Subject: Re: Lockup under heavy network use
To: None <tech-net@netbsd.org>
From: Christos Zoulas <christos@astron.com>
List: tech-net
Date: 09/09/2005 11:31:35
In article <Pine.NEB.4.63.0509081804580.468@lain.ziaspace.com>,
John Klos  <john@ziaspace.com> wrote:
>Hello,
>
>I'm seeing some interesting lockup problems on two different machines. One 
>is a 200 MHz PowerPC 603e system, the other a 250 MHz Cobalt Raq2. Both 
>are serving around 20 to 30 Mbps of web traffic, which is about as much as 
>they can serve. I didn't want faster systems because I didn't want to use 
>much more bandwidth than that (and altq is not exactly production ready 
>yet). However, both of them have locked up under heavy network use. The 
>symptoms are the same: they still respond to ICMP on both IPv4 and IPv6, 
>but don't actually answer requests. Unfortunately, both are colocated, and 
>neither has a serial terminal or console (yet).
>
>The only thing which resembles a clue otherwise is seeing this on a root 
>shell on the Cobalt right before the last lockup:
>
>free(100676a8) bad block. (memtop = 100b3800 membot = 10058550)
>free(10067688) bad block. (memtop = 100b3800 membot = 10058550)
>free(10067668) bad block. (memtop = 100b3800 membot = 10058550)
>free(10068608) bad block. (memtop = 100b3800 membot = 10058550)
>free(10068c08) bad block. (memtop = 100b3800 membot = 10058550)
>free(10067648) bad block. (memtop = 100b3800 membot = 10058550)

This looks like a tcsh error message from bogus free'ing of a memory
block. #define RCHECK and #define DEBUG in tc.alloc.c (it undefs them),
and recompile it. Then you'll get a nice abort() when it happens which
should point out what is wrong.

christos