Subject: ongoing strange network freezes with no error messages....
To: NetBSD Networking Technical Discussion List <tech-net@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-net
Date: 07/07/2001 02:13:46
Well, I upgraded my wee pentium-150 router to NetBSD/i386 1.5W-20010624
tonight.  This time I'd also increased NMBCLUSTERS=32768 in hopes of at
least masking the problem, if not fixing it.

Since the last few times it had been "stuck" was when I was playing a
128k MP3 stream and also doing a bunch of other NFS, FTP, etc. stuff and
making my LAN really busy at the same time, I thought I'd try
replicating these conditions as best as possible to see if the upgrade
and re-config made any difference.

Sure enough non-local traffic came to a grinding halt not long after I'd
started my little "tests".

Once again there were no errors logged anywhere and no apparent
starvation of mbufs (in fact 'netstat -m' reported only three (3!) in
use at the time.  The only apparent clues are the dropped packets
reported by 'netstat -id'.

The only way to find out for sure what's wrong is to login on the
console and try pinging something on the LAN to see if ENOBUFS is
reported.  (and the machine seems quite responsive given what it is....)

After "ifconfig rtk0 down; ping server; ifconfig rtk1 up" all's well
again!  (rtk0 is the LAN, the other two are to DSL and cable modems)

Here's what things looked like shortly afterwards:

# netstat -m
2 mbufs in use:
        1 mbufs allocated to data
        1 mbufs allocated to packet headers
0/28 mapped pages in use
80 Kbytes allocated to network (0% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

# netstat -ind
Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls Drops
rtk0  1500  <Link>        00:48:54:1e:10:e6    71805     0    75638     1  4509   707
rtk0  1500  204.92.254    204.92.254.6         71805     0    75638     1  4509   707
rtk0  1500  fe80::/64     fe80::248:54ff:fe    71805     0    75638     1  4509   707
rtk1  1500  <Link>        00:50:bf:16:94:30    64734     0    55029     0    57     0
rtk1  1500  216.138.200.1 216.138.200.154      64734     0    55029     0    57     0
rtk1  1500  fe80::/64     fe80::250:bfff:fe    64734     0    55029     0    57     0
iy0   1500  <Link>        00:aa:00:cf:42:7c     9504     0    11721     0     4     0
iy0   1500  24.42.191/24  24.42.191.4           9504     0    11721     0     4     0
iy0   1500  fe80::/64     fe80::2aa:ff:fecf     9504     0    11721     0     4     0
lo0   33220 <Link>                                 5     0        5     0     0     0
lo0   33220 fe80::/64     fe80::1                  5     0        5     0     0     0
lo0   33220 ::1/128       ::1                      5     0        5     0     0     0
lo0   33220 127           127.0.0.1                5     0        5     0     0     0


Now interestingly enough this had happened a couple or six times earlier
today before I did the upgrade.  The last time I finally got fed up and
just left a ping running on the console.  Despite beating ever harder on
the LAN and the router connections for the rest of the day, no freezes
happened.  It's as if the running ping kept things flowing despite
whatever condition apparently triggers the freeze.

Is there anything that'll tell me wny the dropped packets were dropped
(i.e. what condition prevented their transmission)?  Are they simply due
to the collisions?  Should I plug the router into my last spare switch
port and see if that changes anything?

Why doesn't the system recover on its own?  I haven't waited forever,
but at least once I remember not noticing the problem for about 20
minutes.  Once things freeze up like this all traffic backs off from
what I can see of the blinking lights.  I'd think that would free up
enough of whatever to get things rolling again, but the only fix seems
to be to actually down the LAN interface.

I've got some trusty old 21041 PCI cards sitting idle at the moment (and
I've noticed they're still about the fastest 10mbit cards ever, beating
even the Intel fxp's on a much much faster machine!).  Should I swap
them into the router and see if that changes anything?

BTW, I've got my switch and the managed hub the router's connected to
both generating SNMP traps when anything goes wildly wrong on the LAN
from their perspectives (and I do get traps even if I pull a connector),
but there's been not a peep from either.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>     <woods@robohack.ca>
Planix, Inc. <woods@planix.com>;   Secrets of the Weird <woods@weird.com>