NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/48586: Kern complains proc table full even when it is is not

>Number:         48586
>Category:       kern
>Synopsis:       The kernel complais that proc table is full even when it is not
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 10 16:05:00 +0000 2014
>Originator:     Tero Kivinen
>Release:        NetBSD 6.1_STABLE
System: NetBSD 8 12:56:43 EET 2014 
Architecture: x86_64
Machine: amd64

The HASTE kernel is GENERIC with larger MAXDSIZE

        include "arch/amd64/conf/GENERIC"
        options MAXDSIZ=34359738368

The problem occurred with GENERIC kernel too, so that change should
not cause it.


        I am creating garmin maps using the java tools (osmosis,
        splitter, mkgmap). The java tools use sun-jre7-7.0.45 from
        pkgsrc. After 15 or so hours of continously running scripts
        the scripts start to complain:

        Map: -110.0..-150.0 -30..0 1300 sa-w-c-ele.img South America
        West Central
        /m/smbkivinen/garmin/bin/ Cannot fork
        /m/smbkivinen/garmin/bin/ Cannot fork
        /m/smbkivinen/garmin/bin/ Cannot fork
        /m/smbkivinen/garmin/bin/ Cannot fork
        and when I check the syslog there is messages saying:

        haste (17:32) /m/smbkivinen/garmin>tail /var/log/messages
        Feb 10 17:24:37 haste /netbsd: proc: table is full - increase 
kern.maxproc or NPROC
        Feb 10 17:25:17 haste /netbsd: proc: table is full - increase 
kern.maxproc or NPROC
        Feb 10 17:27:19 haste /netbsd: proc: table is full - increase 
kern.maxproc or NPROC

        Then when I check how many processes are running ps claims:

        haste (17:41) /m/smbkivinen/garmin>ps agxu | wc
              52     581    4056

        The system is configure to have kern.maxproc of 8000, but it
        is complaining that the proc table is full, even when it only
        has 52 processes running:

        haste (17:42) /m/smbkivinen/garmin>sysctl -a | fgrepmaxproc
        kern.maxproc = 8000
        proc.curproc.rlimit.maxproc.soft = 160
        proc.curproc.rlimit.maxproc.hard = 1044

        I can still run few processes, but if I try to run more than
        few processes the fork fails:

        haste (17:42) /m/smbkivinen/garmin>(sleep 5 & sleep 5 & sleep 5 & sleep 
5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 
5 & sleep 5 & sleep 5 & sleep 5 &)
        zsh: fork failed: resource temporarily unavailable
        haste (17:42) /m/smbkivinen/garmin>ps agxu | wc
        52     581    4056
        haste (17:42) /m/smbkivinen/garmin>
        There does not seem to be any other way to recover from this
        situation than reboot. My guess is that there is something
        wrong with linux emulation in kernel which leaks the processes
        in the proc table or something. During the 16 hours since last
        reboot, I have run osmosis (the java program) around 6500
        times, and the java has crashed around 200 times. The linux
        emulation java seems to randomly crash quite often (usually
        with out of memory error or similar) and rerunning the program
        usually works after few tries. Those java crashes might be
        related to the fact that I am using quite large java memory
        limits, 4 GB, 11 GB or 18 GB depending on the program. Osmosis
        uses 4 GB, splitter uses 18 GB and mkgmap uses 11GB. The
        splitter was not able to process my maps with the default max
        datasize limit (8GB), so thats why I had compile special

        Looking at my kern.maxproc (8000) and number of times I have
        run those linux emulation java programs, it might be that
        actually every single linux emulation java program leaks one
        kernel proc table entry.

        Try running osmosis using sun-jre7-7.0.45 in loop and see if
        that uses proc table up. I have not tried this but with my
        current workload this seems to repeat daily so it is quite
        fast for me to test fixes or something. If you set the
        kern.maxproc to much lower value then this most likely will
        repeat much faster.
        Not known.
        Most likely raising the kern.maxproc to way bigger (65k or
        something) might move the limit further away.

Home | Main Index | Thread Index | Old Index