current-users: exceeded mb

Subject: exceeded mb_map sensibility?
To: None <current-users@NetBSD.ORG>
From: C. Ewen MacMillan <ilixi@tezcat.net>
List: current-users
Date: 09/17/1996 23:55:54
 Hello All,

 First of all, the id-ing type stuff - this is as you will see, not 
 in fact current (rather, based on current circa August 19):

tepe#> uname -a
NetBSD tepe.tezcat.com 1.2_BETA NetBSD 1.2_BETA (tepe-new) #1: 
Mon Aug 26 12:40:10 CDT 1996     
ilixi@tepe.tezcat.com:/usr/src/sys/arch/i386/compile/tepe-new i386
(the above has been reformatted for human-oriented mail-readers).

 Hardware: ASUS P90,  ASUS PCI200 SCSI Controller, 64MB.

 Dmesg output:
 NetBSD 1.2_BETA (tepe-new) #1: Mon Aug 26 12:40:10 CDT 1996
    ilixi@tepe.tezcat.com:/usr/src/sys/arch/i386/compile/tepe-new
CPU: Pentium (GenuineIntel 586-class CPU)
real mem  = 66715648
avail mem = 61054976
using 840 buffers containing 3440640 bytes of memory
mainbus0 (root)
isa0 at mainbus0
ep0 at isa0 port 0x300-0x30f irq 10: 3Com 3C509 Ethernet
ep0: aui/utp address 00:20:af:18:d6:67
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
npx0 at isa0 port 0xf0-0xff: using exception 16
pc0 at isa0 port 0x60-0x6f irq 1: color
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
pci0 at mainbus0 bus 0: configuration mode 1
vendor 0x8086 product 0x122d (class bridge, subclass host, revision 0x02) at pci
0 dev 0 function 0 not configured
vendor 0x8086 product 0x122e (class bridge, subclass ISA, revision 0x02) at pci0
 dev 7 function 0 not configured
vendor 0x10a8 product 0x0000 (class display, subclass VGA, revision 0x00) at pci
0 dev 10 function 0 not configured
ncr0 at pci0 dev 12 function 0: NCR 53c810 SCSI
ncr0: interrupting at irq 11
ncr0: restart (scsi reset).
scsibus0 at ncr0
sd0 at scsibus0 targ 0 lun 0: <SEAGATE, ST51080N, 0943> SCSI2 0/direct fixed
sd0: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
1030MB, 4826 cyl, 4 head, 109 sec, 512 bytes/sec
sd1 at scsibus0 targ 1 lun 0: <SEAGATE, ST15230N, 0638> SCSI2 0/direct fixed
sd1: sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
4095MB, 3992 cyl, 19 head, 110 sec, 512 bytes/sec
sd2 at scsibus0 targ 2 lun 0: <SEAGATE, ST15230N, 0638> SCSI2 0/direct fixed
sd2: sd2(ncr0:2:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
4095MB, 3992 cyl, 19 head, 110 sec, 512 bytes/sec
biomask 840 netmask c40 ttymask c52
changing root device to sd0a

 And the relevant parts of the kernel config (the rest of the configuration
 simply reflects deletions from generic, for devices we do not have):
machine         i386            # architecture, used by config; REQUIRED

options         I586_CPU

# Some BIOSes don't get the size of extended memory right.  If you
# have a broken BIOS, uncomment the following and set the value
# properly for your system.
#options        EXTMEM_SIZE=... # size of extended memory

options DUMMY_NOPS      # speed hack; recommended
options         NMBCLUSTERS=512
options         UCONSOLE
#options                SCSIDEBUG
#options                SCSI_DEBUG_FLAGS
#options                NCR_IOMAPPED
options         INSECURE        # insecure; allow /dev/mem writing for X
options         MACHINE_NONCONTIG

maxusers        64              # estimated number of users
options         TIMEZONE=0      # time zone to adjust RTC time by
options         DST=0           # daylight savings time used by RTC

options         SWAPPAGER       # paging; REQUIRED
options         VNODEPAGER      # mmap() of files

 Back around the end of June I think it was, I first started running
 into mb_map full errors - Chris D. (I think) suggested the increase
 to 512 for NMBCLUSTERS which has worked until:

Sep 16 00:21:25 tepe /netbsd: mb_map full
Sep 16 00:30:12 tepe /netbsd: mb_map full
Sep 16 01:28:53 tepe /netbsd: mb_map full
Sep 16 03:36:48 tepe load: 15 minute load average = 5.05
Sep 16 03:39:02 tepe load: 15 minute load average = 5.18

 We have been getting the error periodically for about the last
 two weeks - for background that might make some sense of all
 of this, we climbed for some reason from ~#350 to #115 in the
 Freenix ratings (i.e. Top 1000 Usenet sites.)

 Obviously,  we have experienced a commensurate increase
 in network traffic. Unfortunately, increasing NMBCLUSTERS does
 not do what one might think it would - it seems to panic the machine
 (completely wiping the dmesg buffer for some reason!). I have tried
 both 768 and 1024 as values, and the problem is decidedly worse
 at 1024. Due to the lack of a 1.1 oriented compatibilty interface for
 if_name/if_unit, I cannot even use SNMP it seems to generate a trap
 at the time the problem occurs. Since this is a production machine, I 
 cannot leave it set to drop to the debugger either, though my intuition
 tells me whatever kills it, kills it HARD.

 I realize that what I am providing is subjective - I cannot unfortunately
 provide you with dmesg outputs from the increased buffer sizes, as
 I cannot produce them. I do recall an experiment with a P75 (we dropped
 it in to test the hardware) generating a weird pmap error in the same
 situation.

 Lastly, the performance curve goes down as the mb_map full errors 
 increase - it seems as though (based on top) whenever the  threshold
 is reached the machine is effectively non-responsive for a period of at
 least 45 seconds. We have effectively dropped to about 25% of
 performance (based on userland NNTP statistics, the only thing relevant
 for this machine) of what we had about one week ago, with only a 2-3%
 total increase in network traffic to the machine.

 I have tried the PR 1903 bufcache patch under a slightly older revision
 but it panicked on me a great deal, and would not boot above 7% bufcache.

 Any suggestions, aside from the obvious (i.e. go back to 1.1, or some other
 more i386 arch friendly OS)?

 Regards,

 CEM


 

 
--
C. Ewen MacMillan

E-mail: ilixi@tezcat.net
Phone:  (312)850-0181