netbsd-bugs: kern/25761: kill -HUP <pid-of-ipmon> freezes system ~weekly

Subject: kern/25761: kill -HUP freezes system ~weekly
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <arto@selonen.org>
List: netbsd-bugs
Date: 05/31/2004 12:39:03
>Number:         25761
>Category:       kern
>Synopsis:       kill -HUP <pid-of-ipmon> freezes system ~weekly
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon May 31 12:40:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Arto Selonen
>Release:        NetBSD-current from ~May 27th
>Organization:
>Environment:
NetBSD blah 2.0F NetBSD 2.0F (BLAH) #39: Thu May 27 09:44:56 EEST 2004  blah@blah:/obj/sys/arch/i386/compile/BLAH i386

>Description:
Occasionally, while rotating logs at midnight, the system freezes, and
typically only kernel debugger can be accessed; otherwise it appears dead.
This has been tracked to sending SIGHUP to ipmon, as part of the log
rotation process. The problem has persisted for well over a year now,
and the frequency of it seems to have increased from ~monthly to ~weekly.
We suspect this might be related to increased external traffic of this system (due to viruses, worms, etc), as it acts as a firewall/gateway.

I have asked others users if they have experienced the same (on current-users, see additional details on http://mail-index.netbsd.org/current-users/2004/03/17/0006.html),
but got no responses.

Traces from kernel debugger have always looked the same (prior to
upgrading from 2.0E to 2.0F):

typical sample, from May 1st:
Stopped in pid 368.1 (squid) at netbsd:cpu_Debugger+0x4:        leave
db> tr
cpu_Debugger(c16b8000,c16c100a,ce3b7c00,c02c607c,c16b8000) at netbsd:cpu_Debugger+0x4
internal_command(c16b8000,ce3b7c0c,f420,1b,0) at netbsd:internal_commmand+0x13c
wskbd_translate(c03a4b00,2,1,1,c2a06000) at netbsd:wskbd_translate+0x6c
wskbd_input(c16b8000,2,1,1,c03734c0) at netbsd:wskbd_input+0x137
pckbd_input(c16bec80,1,0,c03b7264,1) at netbsd:pckbd_input+0x53
pckbportintr(c03a4c60,0,1,1,1) at netbsd:pckbportintr+0x3a
pckbcintr(c16bed00,0,c0270010,30,10) at netbsd:pckbcintr+0x94
Xintr_legacy1() at netbsd:Xintr_legacy1+0xa4
--- interrupt ---
Xspllower(0,ffffffff,400000,0,c035dad0) at netbsd:Xspllower+0xe
malloc(56000,c035ade0,1,0,40100) at netbsd:malloc+0x15d
amap_alloc1(1555d,0,1,81eb000,cd91b898) at netbsd:amap_alloc1+0x7e
amap_copy(cd91b898,d10fe370,1,1,81eb000) at netbsd:amap_copy+0xd3
uvmfault_amapcopy(ce3b7ed4,6,0,1,c03b90a0) at netbsd:uvmfault_amapcopy+0xa4
uvm_fault(cd91b898,81eb000,0,2,0) at netbsd:uvm_fault+0x10d
trap() at netbsd:trap+0x38d
--- trap (number 6) ---
0x80650eb:
db> reboot

Recently, I noticed some UVM related activity and changes taking place,
so I've tried to upgrade more frequently to see if they would also address this issue; they have not. The latest upgrade from ~May 27th sources did change the trace though. Here is the trace from May 30th:

fxp0: device timeout
fxp0: device timeout
Stopped in pid 10.1 (pagedaemon) at     netbsd:cpu_Debugger+0x4:        leave
db> tr
cpu_Debugger(c16be000,c16c800a,cd94fdac,c02c87d4,c16be000) at netbsd:cpu_Debugger+0x4
internal_command(c16be000,cd94fdb8,f420,1b,1) at netbsd:internal_command+0x13c
wskbd_translate(c03a8900,2,1,1,c01eb4c8) at netbsd:wskbd_translate+0x6c
wskbd_input(c16be000,2,1,1,1) at netbsd:wskbd_input+0x137
pckbd_input(c16c5d00,1,cd94e948,c14a8b60,578) at netbsd:pckbd_input+0x53
pckbportintr(c03a8a40,0,1,423f0,1) at netbsd:pckbportintr+0x3a
pckbcintr(c16c5d80,0,10,cd940030,894b0010) at netbsd:pckbcintr+0x94
Xintr_legacy1() at netbsd:Xintr_legacy1+0xa4
--- interrupt ---
Xspllower(0,cd91ce58,3f2,0,c020ee25) at netbsd:Xspllower+0xe
mpidle(cd9254a4,0,8,0,1) at netbsd:mpidle+0xd1
ltsleep(c03bb20c,204,c032297e,0,c03bb214) at netbsd:ltsleep+0x323
uvm_pageout(cd9254a4,410000,419000,0,c010030c) at netbsd:uvm_pageout+0x46
db> re
syncing disks... uvm_fault(0xc03b4d40, 0, 0, 1) -> 0xe
kernel: page fault trap, code=0
Stopped in pid 10.1 (pagedaemon) at     netbsd:bio_doread+0x3c: movl    0xc(%eax),%eax
db> reboot

The problem still appeared at midnight, so I'm assuming it still gets
triggered the same way.

It would seem that UDP traffic is affected first (most/only?), and that leads to DNS failures. Also, previously ipfilter did not log anything between the freeze, and reboot. Now it seems to be logging normally (but it is hard to tell without known test cases).

My *guess* is that this is related to either ipfilter (both 3.x and 4.x) or UVM, or both (interaction).

We have tried the following without success:

  - upgrade OS (whenever changes to ipfilter, UVM, etc)
  - upgrade squid (it runs there as transparent proxy/cache)
  - replace memory
  - rotate squid logs differently (before determining ipmon-relation)

Here is the current dmesg output:
NetBSD 2.0F (BLAH) #39: Thu May 27 09:44:56 EEST 2004
        blah@blah:/obj/sys/arch/i386/compile/BLAH
total memory = 1023 MB
avail memory = 998 MB
BIOS32 rev. 0 found at 0xfda74
mainbus0 (root)
cpu0 at mainbus0: (uniprocessor)
cpu0: Intel Pentium 4 (686-class), 1794.26 MHz, id 0xf24
cpu0: features 3febfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 3febfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features 3febfbff<FXSR,SSE,SSE2,SS,HTT,TM>
cpu0: "Intel(R) Pentium(R) 4 CPU 1.80GHz"
cpu0: I-cache 12K uOp cache 8-way, D-cache 8 KB 64b/line 4-way
cpu0: L2 cache 512 KB 64b/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: using thermal monitor 1
cpu0: 16 page colors
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: Intel 82845 Host (rev. 0x04)
pchb0: random number generator enabled
agp0 at pchb0: aperture at 0xf8000000, size 0x4000000
ppb0 at pci0 dev 1 function 0: Intel 82845 AGP (rev. 0x04)
pci1 at ppb0 bus 1
pci1: memory space enabled
ppb1 at pci0 dev 30 function 0: Intel 82801BA Hub-to-PCI Bridge (rev. 0x05)
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled
vga1 at pci2 dev 9 function 0: Matrox MGA Millennium II 2164W (rev. 0x00)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
fxp0 at pci2 dev 10 function 0: i82550 Ethernet, rev 12
fxp0: interrupting at irq 3
fxp0: Ethernet address 00:02:b3:60:b1:d7
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp1 at pci2 dev 12 function 0: i82550 Ethernet, rev 12
fxp1: interrupting at irq 10
fxp1: Ethernet address 00:02:b3:60:b6:5d
inphy1 at fxp1 phy 1: i82555 10/100 media interface, rev. 4
inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
pcib0 at pci0 dev 31 function 0
pcib0: Intel 82801BA LPC Interface Bridge (rev. 0x05)
piixide0 at pci0 dev 31 function 1
piixide0: Intel 82801BA IDE Controller (ICH2) (rev. 0x05)
piixide0: bus-master DMA support present
piixide0: primary channel wired to compatibility mode
piixide0: primary channel interrupting at irq 14
atabus0 at piixide0 channel 0
piixide0: secondary channel wired to compatibility mode
piixide0: secondary channel interrupting at irq 15
atabus1 at piixide0 channel 1
uhci0 at pci0 dev 31 function 2: Intel 82801BA USB Controller (rev. 0x05)
uhci0: interrupting at irq 12
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
Intel 82801BA SMBus Controller (SMBus serial bus, revision 0x05) at pci0 dev 31 
function 3 not configured
uhci1 at pci0 dev 31 function 4: Intel 82801BA USB Controller (rev. 0x05)
uhci1: interrupting at irq 9
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
isapnp0: no ISA Plug 'n Play devices found
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
IPsec: Initialized Security Association Processing.
wd0 at atabus0 drive 0: <MAXTOR 6L080L4>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 76345 MB, 155114 cyl, 16 head, 63 sec, 512 bytes/sect x 156355584 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1 at atabus0 drive 1: <MAXTOR 6L080L4>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 76345 MB, 155114 cyl, 16 head, 63 sec, 512 bytes/sect x 156355584 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(piixide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA dat
a transfers)
wd1(piixide0:0:1): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA dat
a transfers)
atapibus0 at atabus1: 2 targets
cd0 at atapibus0 drive 0: <CD-S520/A, , 1.7X> cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
cd0(piixide0:1:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data
 transfers)
uhub2 at uhub1 port 2
uhub2: Intel product 0x1122, class 9/0, rev 1.10/0.00, addr 2
uhub2: 4 ports with 4 removable, self powered
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
fxp0: Microcode loaded: int delay: 1000 usec, max bundle: 6
fxp1: Microcode loaded: int delay: 1000 usec, max bundle: 6
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)

Anything else I could provide to help in finding the cause of these
annoying freezes?
>How-To-Repeat:
I don't have a repeatable scenario. The system acts as firewall/gateway, with ipfilter, and no routing protocols. It also runs squid as transparent proxy/cache with ipnat used to direct www traffic to squid. There are
also IPSEC tunnels from the system to external systems, and other NAT rules for masking certain internally used IANA addresses.
>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: