current-users: Really bad network problem. High-speed ethernet, netbsd-current

Subject: Really bad network problem. High-speed ethernet, netbsd-current
To: None <current-users@NetBSD.ORG>
From: Mika Nystroem <mika@cs.caltech.edu>
List: current-users
Date: 03/31/1996 04:40:36
Hi everyone, 
   My cluster of machines is making progress. Now we've got six machines 
set up with one fileserver. Up until now, the operating system has been
unbelievably stable, with only a few crashes now and then, usually when
user "root" does something strange. 
   Things have changed. I have the systems set up on their own subnet
and sometimes the machines will spontaneously reboot. Usually it seems
like I do something "on the network," but that can limit itself to 
ctrl-C'ing a tcpdump or exiting X, thereby killing some telnets. Then
the system crashes. Sometimes just one machine, sometimes the machines
at both ends of the connection, and sometimes ALL THE MACHINES ON THE
NETWORK! 
   All the systems are running NetBSD-current as of March 17 (NetBSD 1.1A);
I have tried NetBSD 1.1B, but the same thing happens. (The reason I went
back to 1.1A is that I don't have a full OS setup for 1.1B...)

Here is what comes up on the consoles:

(I have to read this quickly before the crash dump is done so there may be
some errors)

On the server: (apparently always in the same place)
Trap type 6 code 0 
eip f818ecee cs8 eflags 10202 cr2 0 cp1 c0000000

(possibly relevant kernel symbols)
f818e5f4 T _spp_reass
f818eca0 T _spp_ctlinput
f818ed84 T _spp_quench
f818ed9c T _spp_output
...
f822560c B _tcpstat
f82256dc B _end
ffc00000 A _APTmap
fffff000 A _APTD

On one of the clients:
Trap type 6 code 0
eip f812982b   cs 8 eflags 10082 cr 2 18 cp 10

(see above)
f8129800 T _vn_bwrite
f8129810 T _bdwrite
f8129878 T _bawrite
f81298ac T _brelse

The message is something along the lines of
"Fatal page fault in supervisor mode...
   vm_fault(f87f7700,0,1,0) -> 1 " (on the server)

I am not sure how relevant the dump is, but...
examining the dump with gdb gives the following:
=====================================================

GDB 4.11 (i386-netbsd), Copyright 1993 Free Software Foundation, Inc...
panic: trap
#0  mi_switch () at ../../../../kern/kern_synch.c:612
612             microtime(&runtime);
(kgdb) where
#0  mi_switch () at ../../../../kern/kern_synch.c:612
#1  0xf8117629 in tsleep (ident=0x0, priority=4, wmesg=0xf81d8016 "scheduler", 
    timo=0) at ../../../../kern/kern_synch.c:358
#2  0xf81d8080 in scheduler () at ../../../../vm/vm_glue.c:396
#3  0xf810e1c6 in main (framep=0x0) at ../../../../kern/init_main.c:369
(kgdb) list
607             /*
608              * Pick a new current process and record its start time.
609              */
610             cnt.v_swtch++;
611             cpu_switch(p);
612             microtime(&runtime);
613     }
614     
615     /*
616      * Initialize the (doubly-linked) run queues
(kgdb) print &runtime
$1 = (struct timeval *) 0xf823dea4
(kgdb) print *runtime
Attempt to take contents of a non-pointer value.
(kgdb) print runtime
$2 = {tv_sec = 828230639, tv_usec = 640642}
(kgdb) up
#1  0xf8117629 in tsleep (ident=0x0, priority=4, wmesg=0xf81d8016 "scheduler", 
    timo=0) at ../../../../kern/kern_synch.c:358
358             mi_switch();
(kgdb) list
353                     }
354             } else
355                     sig = 0;
356             p->p_stat = SSLEEP;
357             p->p_stats->p_ru.ru_nvcsw++;
358             mi_switch();
359     #ifdef  DDB
360             /* handy breakpoint location after process "wakes" */
361             asm(".globl bpendtsleep ; bpendtsleep:");
362     #endif
(kgdb) print panicstr
$3 = 0xf81e6800 "trap"
(kgdb) up
#2  0xf81d8080 in scheduler () at ../../../../vm/vm_glue.c:396
396                     tsleep((caddr_t)&proc0, PVM, "scheduler", 0);
(kgdb) down
#1  0xf8117629 in tsleep (ident=0x0, priority=4, wmesg=0xf81d8016 "scheduler", 
    timo=0) at ../../../../kern/kern_synch.c:358
358             mi_switch();
(kgdb) down
#0  mi_switch () at ../../../../kern/kern_synch.c:612
612             microtime(&runtime);

============

The clients have / and /var locally, /usr is NFS-mounted from the server.
Also lots of amd stuff and other odds and ends. The clients have 1 GB seagate
(E)IDE drives, and the server has two Adaptec 7870 controllers (configured to
use one), and routes packets to and from the main CS network.

The network cards in the machines are 3Com 3C595 10/100 Mbps cards.
The outside card on the server is operated at 10 Mbps, but the inside
(i.e., the one on the same network as the clients) is operated at the
subnet speed of 100 Mbps. We have two Bay Area networks 100BaseTX hubs
to connect everything together. 

Is it possible that the problem is due to a (previously undetected) race
condition in the device driver that becomes worse because of the higher
network speed? The number of hosts that go down during any
incident is really variable.. from one to all five (and anything in between).
The server (a P133) seems a little more resistant to the problem than
the clients (P120's), but it crashes too, as you can see above.
Note: the problem does not seem to happen more frequently when there is a 
lot of traffic! It *very* frequently happens when the clients attempt to
mount /usr from the server, during the bootup sequence.

I haven't been able to catch a packet with tcpdump.
Just now I had a machine crash, try to boot, crash, try to boot. I started
tcpdump on another machine (apparently unaffected) to see what was going on 
on the network, and the machine that was crashing booted just fine. Then 
I ctrl-C'd tcpdump, and the machine I was trying to monitor crashed
immediately (I also power-cycled a third machine...?), and then the 
one running tcpdump crashed too. I have tried very hard to catch the
culprit (if it is indeed a network packet) with tcpdump, but either the
machine running tcpdump crashes too, or else the network "coincidentally"
goes down almost immediately after I ctrl-C tcpdump.

Does anyone have an idea what's going on here? I'm supposed to have a lab
of 24 machines up and running smoothly by this coming week :)

  Regards,
   Mika
   <mika@vlsi.cs.caltech.edu>