Subject: "Too many open files" hang in 1.3 and 1.3.3
To: 'port-i386@netbsd.org' <port-i386@netbsd.org>
From: Gunnar Helliesen <gunnar@bitcon.no>
List: port-i386
Date: 02/21/1999 07:56:37
Executive summary ;-):

Problem: The kernel prints "Too many open files" _rapidly_ on the
console and in /var/log/messages. I'm unable to log in, but the system
responds to ping.

System: NetBSD/i386 1.3 and 1.3.3

This has happened before, how do I fix it?

More detailed report:

Right now I have a server 500 Km away that I can't log into. It responds
to pings and it even responds to HTTP requests (it delivers Web pages)
but trying to connect with FTP just times out and trying to connect with
telnet just leaves a blank screen (no banner and no 'login:' prompt).

Last time this happened was 4 days ago on NetBSD 1.3. Since then I've
upgraded the kernel to 1.3.3 (userland is still 1.3). The hardware is a
Pentium PRO with 128 MB of physical RAM and 1 GB of swap. I've rolled my
own kernel, commenting out DDB (because I want the thing to auto reboot)
and all devices not physically present. As I upgraded to 1.3.3 I also
bumped maxusers from 64 to 128, but this clearly didn't help. I can't
supply a dmesg output from the system right now for obvious reasons...

Last time this happened I got someone at the other end to help me reboot
the system. When it came back up I found out that what had happened was
that the kernel had started spewing out "Too many open files" messages
and quickly filled up the / partition (yes, /var is on /, I clearly have
to change this). Incidentally (or perhaps not) the system had been up
for 4 days then as well.

Anywho, I need to get this fixed. I've read through the options manpage
and I've even tried to RTFS, but I can't find where the max number of
open files is set. Of course I don't know for sure that it's the same
problem this time, but the symptoms are identical.

In the FWIW dept., this server is used as a busy Web and FTP server.
"Idle" load is ~20 httpd processes and ~30 ftpd processes. Busy load is
~100 httpd processes and ~150 ftpd processes. Extreme loads can go way
beyond that, but I can never tell exactly how much because it always
crashes. On average it pumps out 1.3 Mb/s of data 24 hours a day, 7 days
a week, with peaks up to 8 Mb/s (10 Mb/s Ethernet on a dedicated switch
port). At least that's what the switch reports according to MRTG. I
suspect the averages could be a lot higher if the damn thing would stay
up, a crash always leads to hours of downtime because I'm dependent on
others 500 Km away to reboot it. I guess this time it will stay "down"
until Monday morning. Sigh...

Now what, help urgently needed!

BTW, upgrading to -current is not an option, because:
1. This system is (supposed to be) in production.
2. I have to travel to Oslo to do the upgrade and that's expensive. I
will upgrade to 1.4 when the war stories are in.

BTW2, I'd love to hear from others running busy NetBSD/i386 servers
remotely. Have you found a way of "redirecting" the console far away,
over the Internet (I'm thinking a terminal server connected to a serial
port on the machine and compiling a kernel with serial console). I could
then telnet into the terminal server. Security issues aside, would this
give full control over the machine for upgrades etc.? Still doesn't
solve those situations when you need a hard reset, any suggestions?

Gunnar

--
Gunnar Helliesen   | Bergen IT Consult AS  | NetBSD/VAX on a uVAX II
Systems Consultant | Bergen, Norway        | '86 Jaguar Sovereign 4.2
gunnar@bitcon.no   | http://www.bitcon.no/ | '73 Mercedes 280 (240D)