Subject: possible kernel bug: stat mclpl after heavy net load
To: None <port-sparc64@NetBSD.org>
From: None <netbsd@wiki.uni-konstanz.de>
List: port-sparc64
Date: 12/14/2005 22:55:07
Hi all,

i am not sure, but for me it looks like a kernel bug:

I have a sun netra t1, with netbsd-2-0 sparc64 form 8 dec 05.
The machine is a mirror and serves serveral DVD images (Wikipedia) via
http. I have the problem that the httpd feezes after serveral 100Gig
of network traffic (The network interface serves between 10-6MB/s
constantly).

At first i used apache2. After around 700G traffic, the apache has
the state mclpl. It is not possible to kill the httpd with -9.

After that i tried thttpd. But there is nearly the same problem, after
around 700G of traffic the thttpd freezes and i am not able to kill
the daemon with -9. In both cases all other processes seems to run
normally (sshd, tcsh, top, lsof etc). But i am forced to reboot the
machine to get rid of the freezed daemons.

Since that i still use thttpd, but i had to reboot the server more or
less once a day, when the httpd gets freezed again. 

Q1) Is this really a problem of the kernel?

Q2) How can i document the problem best? What tools can i use? What
infos do i need for a bugreport (if it is)? What can i do if the
problem apears next time?

Q3) Is this a special problem of the sparc64 port or is it a general
netbsd problem? Where do i post the problem?

tnx Stephan 'Pitz' Pietzko

----------------------------------------------------------------------
some additional infos:
----------------------------------------------------------------------
dmesg

console is /pci@1f,0/pci@1,1/ebus@1/su@14,3803f8
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
2005
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 2.1_STABLE (GENERIC) #0: Thu Dec  8 12:15:55 CET 2005
        root@nepal:/usr/obj/sys/arch/sparc64/compile/GENERIC
total memory = 256 MB
avail memory = 239 MB
bootpath: /pci@1f,0/pci@1,1/scsi@2,0/disk@0,0
mainbus0 (root): SUNW,UltraSPARC-IIi-cEngine: hostid 80d9aed0
cpu0 at mainbus0: SUNW,UltraSPARC-IIi @ 440.043 MHz, version 0 FPU
cpu0: 32K instruction (32 b/l), 16K data (32 b/l), 2048K external (64
b/l)
psycho0 at mainbus0 addr 0xfffc0000
SUNW,sabre: impl 0, version 0: ign 7c0 bus range 0 to 3; PCI bus 0
DVMA map: c0000000 to e0000000
IOTSB: 6f2000 to 772000
pci0 at psycho0
pci0: i/o space, memory space enabled
ppb0 at pci0 dev 1 function 1: Sun Microsystems, Inc. Simba PCI bridge
(rev. 0x13)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
ebus0 at pci1 dev 1 function 0
ebus0: Sun Microsystems, Inc. PCIO Ebus2, revision 0x01
auxio0 at ebus0 addr 726000-726003, 728000-728003, 72a000-72a003,
72c000-72c003, 72f000-72f003
power at ebus0 addr 724000-724003 ipl 37 not configured
SUNW,pll at ebus0 addr 504000-504002 not configured
com0 at ebus0 addr 3803f8-3803ff ipl 28: ns16550a, working fifo
com0: console
com1 at ebus0 addr 3602f8-3602ff ipl 20: ns16550a, working fifo
lpt0 at ebus0 addr 340278-340287, 30015c-30015d, 700000-70000f ipl 34
fdthree at ebus0 addr 3203f0-3203f7, 706000-70600f, 720000-720003 ipl
39 not configured
clock0 at ebus0 addr 0-1fff: mk48t59
flashprom at ebus0 addr 0-fffff not configured
watchdog at ebus0 addr 200000-20003f ipl 4 not configured
display7seg at ebus0 addr 200040-200040 not configured
beeper at ebus0 addr 722000-722003 not configured
flashprom at ebus0 addr 400000-5fffff not configured
flashprom at ebus0 addr 800000-9fffff not configured
i2c at ebus0 addr 600000-600003 ipl 40 not configured
i2c at ebus0 addr 100000-100003 ipl 27 not configured
SUNW,lom at ebus0 addr 400000-400063 not configured
hme0 at pci1 dev 1 function 1: Sun Happy Meal Ethernet, rev. 1
hme0: interrupting at ivec 3021
hme0: Ethernet address 08:00:20:d9:ae:d0
ukphy0 at hme0 phy 0: Generic IEEE 802.3u media interface
ukphy0: OUI 0x0006b8, model 0x000c, rev. 1
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
ukphy1 at hme0 phy 1: Generic IEEE 802.3u media interface
ukphy1: OUI 0x0006b8, model 0x000c, rev. 1
ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
esiop0 at pci1 dev 2 function 0: Symbios Logic 53c875 (ultra-wide
scsi)
esiop0: using on-board RAM
esiop0: interrupting at ivec 20
scsibus0 at esiop0: 16 targets, 8 luns per target
hme1 at pci1 dev 3 function 1: Sun Happy Meal Ethernet, rev. 1
hme1: interrupting at ivec 301a
hme1: Ethernet address 08:00:20:d9:ae:d0
ukphy2 at hme1 phy 0: Generic IEEE 802.3u media interface
ukphy2: OUI 0x0006b8, model 0x000c, rev. 1
ukphy2: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
ppb1 at pci0 dev 1 function 0: Sun Microsystems, Inc. Simba PCI bridge
(rev. 0x13)
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled
ppb2 at pci2 dev 1 function 0: Digital Equipment DECchip 21150 PCI-PCI
Bridge (rev. 0x06)
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled
cmdide0 at pci3 dev 14 function 0
cmdide0: CMD Technology PCI0646 (rev. 0x03)
cmdide0: bus-master DMA support present
cmdide0: primary channel configured to native-PCI mode
cmdide0: using ivec 1802 for native-PCI interrupt
atabus0 at cmdide0 channel 0
cmdide0: secondary channel configured to native-PCI mode
atabus1 at cmdide0 channel 1
pcons at mainbus0 not configured
No counter-timer -- using %tick at 440MHz as system clock.
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <IBM-ESXS, MAT3073NC     FN, B411>
disk fixed
sd0: 70006 MB, 78753 cyl, 2 head, 910 sec, 512 bytes/sect x 143374000
sectors
sd0: sync (50.00ns offset 16), 16-bit (40.000MB/s) transfers, tagged
queueing
sd1 at scsibus0 target 1 lun 0: <IBM-ESXS, MAT3073NC     FN, B411>
disk fixed
sd1: 70006 MB, 78753 cyl, 2 head, 910 sec, 512 bytes/sect x 143374000
sectors
sd1: sync (50.00ns offset 16), 16-bit (40.000MB/s) transfers, tagged
queueing
cd0 at scsibus0 target 6 lun 0: <TOSHIBA, CD-ROM XM-3801TA, 1057>
cdrom removable
cd0: sync (100.00ns offset 8), 8-bit (10.000MB/s) transfers
root on sd0a dumps on sd0b
root file system type: ffs
----------------------------------------------------------------------
top 

load averages:  0.20,  0.13,  0.13    03:05:33
31 processes:  1 runnable, 29 sleeping, 1 on processor

Memory: 129M Act, 66M Inact, 1464K Wired, 6720K Exec, 161M File, 3208K Free
Swap: 4266M Total, 4266M Free


  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
17617 www      -22    4  6024K 6536K mclpl     71:26  0.00%  0.00% <httpd>
17267 www      -22    4  5992K 6480K mclpl     58:58  0.00%  0.00% <httpd>
17362 www      -22    4  5904K 6424K mclpl     19:21  0.00%  0.00% <httpd>
    7 root     -18    0     0K   72M pgdaemon  12:19  0.00%  0.00% [pagedaemon]
    9 root     -18    0     0K   72M aiodoned   4:49  0.00%  0.00% [aiodoned]
 2755 www      -22    0  5400K 5480K mclpl      3:54  0.00%  0.00% <httpd>
    8 root      18    0     0K   72M syncer     2:27  0.00%  0.00% [ioflush]
20080 www      -22    0  5904K 6296K mclpl      1:57  0.00%  0.00% <httpd>
 9939 www      -22    0  5464K 5520K mclpl      0:56  0.00%  0.00% <httpd>
  375 root       2    0  4312K 4864K select     0:30  0.00%  0.00% httpd
  219 root       2    0   288K 1048K poll       0:03  0.00%  0.00% syslogd
11833 root       2    0   448K 2584K select     0:02  0.00%  0.00% sshd
25697 root      28    0   176K 1120K CPU        0:00  0.00%  0.00% top
  456 root      10    0  2016K 1704K RUN        0:00  0.00%  0.00% tcsh
    2 root      14    0     0K   72M crypto_w   0:00  0.00%  0.00% [cryptoret]
    1 root      10    0   136K 1080K wait       0:00  0.00%  0.00% <init>
  559 root      10    0   304K 1064K nanoslee   0:00  0.00%  0.00% cron
19406 www        2    0  4312K 2400K netcon     0:00  0.00%  0.00% <httpd>
----------------------------------------------------------------------
uname -a 

NetBSD nepal 2.1_STABLE NetBSD 2.1_STABLE (GENERIC) #0: Thu Dec  8 12:15:55 CET 2005  root@nepal:/usr/obj/sys/arch/sparc64/compile/GENERIC sparc64
----------------------------------------------------------------------
some lines from lsof 

httpd     17267  www   17u    IPv4                   0t0     TCP no PCB, CANTSENDMORE, CANTRCVMORE
httpd     17267  www   19u    IPv4                   0t0     TCP no PCB, CANTSENDMORE, CANTRCVMORE
70 lines with the same message