tech-net: stressing the network ...

Subject: stressing the network ...
To: None <tech-net@netbsd.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-net
Date: 03/05/2000 13:45:07
I have a 1.4T (the 20000213 i386 snapshot) that I have been stressing the
network of in the past day or so ...   This is what netstat-m tells me at the
minute:

marinara# uptime
 1:03PM  up 23:11, 3 users, load averages: 1.06, 1.08, 1.08
marinara# netstat -m
112 mbufs in use:
        103 mbufs allocated to data
        3 mbufs allocated to packet headers
        6 mbufs allocated to socket names and addresses
4294258022/256 mapped pages in use
636 Kbytes allocated to network (1181% in use)
0 requests for memory denied
0 requests for memory delayed
916946 calls to protocol drain routines
marinara# uname -a
NetBSD marinara.cs.mu.OZ.AU 1.4T NetBSD 1.4T (GATEWAY) #0: Thu Feb 24 22:29:54 EST 2000     root@marinara.cs.mu.OZ.AU:/usr/src/sys/arch/i386/compile/GATEWAY i386

To me, something looks just a little broken there...

First, this system is configured to be just a normal workstation, no hacks at 
all done to its kernel config to make it more suited as a network server.  The
kernel has complained several times while I have been beating on it about
how I should increase NMBCLUSTERS.  As in ...

WARNING: mclpool limit reached; increase NMBCLUSTERS

and "several times" might be a little of an under estimate (the entire dmesg
log buffer is full of that message, I have no idea how many more were rotated
out).

The way I got it into this state was to set it up as an ftp server, and give 
it an (almost) 2Gb file to serve (the slightly less than 2GB size was just the
natural size of a file I wanted to distribute - it wasn't set that way because 
of any desire to keep the size under 0x80000000 or anything like that).

This system (marinara) has a 3C905C-TX connected to a dumb 10Mbit
unswitched hub (which is then hung off a 10Mbit switched port of a switch
that is connected to a router via a 100Mbit link).   I set up 18 or 20 or
something identical machines, but with each of those on a switched 100Mbit
port on a different switch connected to another 100Mbit port on the router.
Then I started all of those clients fetching the 2Gbit file in parallel.

Or rather, I started as many as marinara (the server) could take before deciding
that its mclpool had run out, and no more connections could be accepted (which
was about 15 I think).   Those 15 clients started fetching at an average
sustained rate of around 60KB/sec (which is reaosnable, that's 900KB/sec
over the 10Mbit/sec ethernet - plus overheads etc).   They all kept that
up for something between 5 and 6 hours, then stalled (that's approx half
way through the transfer).   The client and server ftp processes were still
there and alive, but no more data was making it through any of the connections.
The connections all killed cleanly (kill the client, the server exited).

I restarted the transfers (ftp -R) - this wasn't puerly an exercise, I really
wanted the data transferred) in groups of 4 clients at a time.   The 15 or so
that had been running all finished that OK (the others I foolishly turned off
and so can't get at over the net ...).

netstat -in (on marinara) shows...

marinara# netstat -in
Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
ex0   1500  <Link>        00:50:da:61:0d:7a 15388098     1 22809894     0 4605687
ex0   1500  128.250.26.12 128.250.26.152    15388098     1 22809894     0 4605687
ex0   1500  fe80::/64     fe80::250:daff:fe 15388098     1 22809894     0 4605687
lo0   32976 <Link>                               116     0      116     0     0
lo0   32976 fe80::/64     fe80::1                116     0      116     0     0
lo0   32976 ::1/128       ::1                    116     0      116     0     0
lo0   32976 127           127.0.0.1              116     0      116     0     0
gif0* 1280  <Link>                                 0     0        0     0     0

(none of these are yet connected to our IPv6 net, so they have only
link local addresses - but that's irrelevant).

If anyone would like to look at this, and take any guesses as to what may have
been going on to cause that screwey netstat -m output, I may at some time have
a reason to do all of this all over again.

For what its worth, this was part of the installation process for a new lab of
systems ... being at a university has the (dis)advantage that from time to time
a whole set of systems turn up which all need to be installed and made useable
today...

The NetBSD install onto these takes 6 minutes (each - though they can do
as many in parallel as you can make CD copies - and have people around to
rescue the CD when it is done before the absurd BIOS boot method sucks the
CD back in and repeats the entire process...  playing with the CDs is the the
only human interaction)

That's using a custom boot CD that boots, does the install (including local 
config) and reboots (here the CD needs to be removed) to NetBSD where
the net config is done using DHCP.   The 6 minutes is from the time where
the BIOS clears the screen as it starts its self tests until when the NetBSD
login prompt occurs - and is a full install, plus a few essential binary 
packages (from the CD - made from pkgsrc of a week or so ago).

The 2GB file is the windows installation ... (these are to be dual boot, sadly).
I haven't yet found a better way of automating that, so it gets done once, then dd'd
and gzip'd, and distributed over the net, then ungzip'd and dd'd onto the windows
partition made during the NetBSD install.   If anyone happens to know of a "image
copy this windows filesystem" tool (one which skips all the unused blocks, which
tend to be full of garbage, so gzip doesn't make them reduce to nothing) I'd appreciate 
learning about it (of course, it needs to run under NetBSD - or be able to be
made runnable there, no windows applications need apply..).  A few registry
tweeks are needed on the windows systems to make them useable (to make
them unique) but that's all that should be needed.

kre

ps: don't reply and tell me that I should have connected the server to a 
100Mbit port - of course I should - but there's a difference between what
should be done, and what is going to actually get done in time for the installs
to be finished in time...   And yes, I could also have copied the file once to 
one of the 100bit systems, then made it be a server, but at the time I really
just wanted to start them all going and go home - how many hours the
copy took wasn't that important (I expected them to complete in about
12 hours, much of which was when I was going to be asleep).