Subject: Random lockups and swap lockup
To: None <port-sparc@netbsd.org>
From: Rui Paulo <rpaulo@netbsd-pt.org>
List: port-sparc
Date: 03/17/2005 21:22:18
Greetings.
I've been testing NetBSD/sparc by malloc()'ing large quantities of memory,
thus making some swap/mem stressing to test the overall stability of
this system.
Strangely, what I got was some lockups. I guess I found 2 different
problems when doing this stressing:
1) When swap and memory both reach near-100% usage level the system locks
up too (I cant kill the process with ddb) and I see a lot of messages like
this in the kernel output:
...
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
cpu0: bogus interrupt ipl 0xa pc=0xf010cfa8 npc=0xf010cf94 psr=404006c6<S,PS>
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
...
(also, UVM tries to kill all the processes that are eating swap and memory,
but somehow he isn't able to do it and they all continue to call malloc())
Immediatly I entered ddb to check what could be the origin of the problem:
db{0}> show all procs
PID PPID PGRP UID S FLAGS LWPS COMMAND WAIT
693 274 693 1000 2 0x4002 1 top poll
274 689 274 1000 2 0x4002 1 ksh pause
689 651 651 1000 2 0x100 1 sshd select
240 622 240 1000 2 0x4002 1 eat
>335 680 335 1000 2 0x4002 1 eat
654 100 654 1000 2 0x6002 1 eat
589 96 589 1000 2 0x4002 1 eat
651 375 651 0 2 0x101 1 sshd netio
680 263 680 1000 2 0x4002 1 ksh pause
263 596 596 1000 2 0x100 1 sshd select
100 99 100 1000 2 0x4002 1 ksh pause
99 584 584 1000 2 0x100 1 sshd select
96 70 96 1000 2 0x4002 1 ksh pause
70 585 585 1000 2 0x100 1 sshd select
622 600 622 1000 2 0x4002 1 ksh pause
600 597 597 1000 2 0x100 1 sshd select
596 375 596 0 2 0x101 1 sshd netio
584 375 584 0 2 0x101 1 sshd netio
585 375 585 0 2 0x101 1 sshd netio
597 375 597 0 2 0x101 1 sshd netio
557 1 557 0 2 0x4002 1 login ttyin
569 1 569 0 2 0 1 cron nanosle
412 543 543 12 2 0x4100 1 qmgr select
428 543 543 12 2 0x4100 1 pickup select
543 1 543 0 2 0x4108 1 master select
375 1 375 0 2 0 1 sshd select
347 1 347 0 2 0 1 rtadvd poll
350 330 330 0 2 0 1 ntpd pause
330 1 330 15 2 0x100 1 ntpd pause
303 1 303 0 2 0 1 dhcpd select
242 1 242 0 2 0 1 mount_mfs mfsidl
231 1 231 14 2 0x500 3 named *
144 1 144 0 2 0 1 ipmon nanosle
218 1 218 0 2 0 1 altqd select
189 1 189 0 2 0 1 syslogd
5 0 0 0 2 0x20200 1 aiodoned aiodone
4 0 0 0 2 0x20200 1 ioflush syncer
3 0 0 0 2 0x20200 1 pagedaemon
2 0 0 0 2 0x20200 1 scsibus0 sccomp
1 0 1 0 2 0x4000 1 init
0 -1 0 0 2 0x20200 1 swapper
eat is a simple C program that does exactly this (yes, I'm running 4
instances of it):
int main(void) {
for (; malloc(4); )
continue;
return 0;
}
db{0}> machine proc
LWP 0xf2e405e0: PID:335.1 CPU:0 stat:7 vmspace:0xf2ddcda0 ctx: 0xf02d68f0 cpuset
1
pmap:0xf2e80100 wchan:0x0 pri:86 upri:86
maxsaddr:0xe0000000 ssiz:1 pg or 1000B
profile timer: 0 sec 0 usec
pcb: 0xf2e7c000
db{0}> machine cpu
CPU# CPUINFO FLAGS CURLWP CURPROC FPLWP
0 0xf2198000 9000 0xf2e405e0 0xf2e5f978 0xf29c0a18
1 0xf2199000 b000 0xf2e40558 0xf2e5f7e8 0x0
The system _doesn't_ panic(), but locks up. Here is the traceback:
db{0}> bt
cpu_Debugger(0xf059ee70, 0x0, 0x17, 0x9790, 0x480, 0x1) at netbsd:zsc_intr_hard+
0x158
zsc_intr_hard(0x8, 0x7ffffc00, 0x130a91b5, 0x38c2, 0xffff, 0xa00) at netbsd:zsha
rd+0x40
zshard(0x0, 0xf01c7928, 0xd00, 0x408000e6, 0xf0262800, 0xf028f800) at netbsd:spa
rc_interrupt44c+0x148
sparc_interrupt44c(0x1, 0xa00, 0xf02608f8, 0x7cf3, 0x798, 0xf0261090) at netbsd:
hardclock+0x380
hardclock(0xf028fd50, 0xf023b30c, 0x400040, 0x0, 0x600, 0x100) at netbsd:lockmgr
+0x4e8
lockmgr(0xf028fd50, 0x400042, 0x400040, 0x408000e1, 0x100, 0xf02628f8) at netbsd
:sparc_interrupt44c+0x110
sparc_interrupt44c(0xf0213400, 0x6, 0x400040, 0x7cb7, 0x5b8, 0xf0262800) at netb
sd:hardclock+0x370
hardclock(0xf028fd50, 0xf023b494, 0x400040, 0x0, 0x600, 0x10117a24) at netbsd:lo
ckmgr+0x4e8
lockmgr(0xf028fd50, 0x400042, 0x400040, 0x408000e4, 0xffff, 0x10147740) at 0xf00
0665c
0xf000665c(0x4, 0xf01c180c, 0xe00, 0x408000a5, 0xffff, 0x36c102c) at netbsd:acqu
ire+0x4c
acquire(0xf028fd50, 0xf2e7de4c, 0x400000, 0x0, 0x600, 0xf0248800) at netbsd:lock
mgr+0x4e8
lockmgr(0xf028fd50, 0x400002, 0x400000, 0xf2e7dfb0, 0x400, 0xf0262800) at netbsd
:_kernel_proc_lock+0x14
_kernel_proc_lock(0xf2e405e0, 0xf2e7df28, 0xfffffffc, 0x36c3000, 0x20, 0x100fda7
8) at netbsd:syscall+0x284
syscall(0x11, 0xf2e7dfb0, 0x1011915c, 0x1003d6a0, 0x0, 0xf2e7df28) at 0xf0006500
2) Sometimes, the system locks up without even printing anything to the console,
so my guess would be a deadlock of some kind in mem/swap handling, or am I
completly wrong ?
This is a SPARCStation 20 MP (2 * LM50 - 50Mhz, no external cache) with
128MB of RAM. The whole system was compiled with CPUFLAGS=-mcpu=supersparc.
Thanks in advance.
--
Rui Paulo <rpaulo@netbsd-pt.org> http://www.netbsd-pt.org/users/rpaulo/