Subject: Random lockups and swap lockup
To: None <port-sparc@netbsd.org>
From: Rui Paulo <rpaulo@netbsd-pt.org>
List: port-sparc
Date: 03/17/2005 21:22:18
Greetings.
I've been testing NetBSD/sparc by malloc()'ing large quantities of memory,
thus making some swap/mem stressing to test the overall stability of 
this system.
Strangely, what I got was some lockups. I guess I found 2 different
problems when doing this stressing:
1) When swap and memory both reach near-100% usage level the system locks
up too (I cant kill the process with ddb) and I see a lot of messages like
this in the kernel output:
...
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
cpu0: bogus interrupt ipl 0xa pc=0xf010cfa8 npc=0xf010cf94 psr=404006c6<S,PS>  
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
warning: resource shortage: 1 pages of swap lost
...

(also, UVM tries to kill all the processes that are eating swap and memory,
but somehow he isn't able to do it and they all continue to call malloc())

Immediatly I entered ddb to check what could be the origin of the problem:

db{0}> show all procs
 PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND    WAIT
 693            274      693       1000 2  0x4002    1              top    poll
 274            689      274       1000 2  0x4002    1              ksh   pause
 689            651      651       1000 2   0x100    1             sshd  select
 240            622      240       1000 2  0x4002    1              eat
>335            680      335       1000 2  0x4002    1              eat
 654            100      654       1000 2  0x6002    1              eat
 589             96      589       1000 2  0x4002    1              eat
 651            375      651          0 2   0x101    1             sshd   netio
 680            263      680       1000 2  0x4002    1              ksh   pause
 263            596      596       1000 2   0x100    1             sshd  select
 100             99      100       1000 2  0x4002    1              ksh   pause
 99             584      584       1000 2   0x100    1             sshd  select
 96              70       96       1000 2  0x4002    1              ksh   pause
 70             585      585       1000 2   0x100    1             sshd  select
 622            600      622       1000 2  0x4002    1              ksh   pause
 600            597      597       1000 2   0x100    1             sshd  select
 596            375      596          0 2   0x101    1             sshd   netio
 584            375      584          0 2   0x101    1             sshd   netio
 585            375      585          0 2   0x101    1             sshd   netio
 597            375      597          0 2   0x101    1             sshd   netio
 557              1      557          0 2  0x4002    1            login   ttyin
 569              1      569          0 2       0    1             cron nanosle
 412            543      543         12 2  0x4100    1             qmgr  select
 428            543      543         12 2  0x4100    1           pickup  select
 543              1      543          0 2  0x4108    1           master  select
 375              1      375          0 2       0    1             sshd  select
 347              1      347          0 2       0    1           rtadvd    poll
 350            330      330          0 2       0    1             ntpd   pause
 330              1      330         15 2   0x100    1             ntpd   pause
 303              1      303          0 2       0    1            dhcpd  select
 242              1      242          0 2       0    1        mount_mfs  mfsidl
 231              1      231         14 2   0x500    3            named       *
 144              1      144          0 2       0    1            ipmon nanosle
 218              1      218          0 2       0    1            altqd  select
 189              1      189          0 2       0    1          syslogd
 5                0        0          0 2 0x20200    1         aiodoned aiodone
 4                0        0          0 2 0x20200    1          ioflush  syncer
 3                0        0          0 2 0x20200    1       pagedaemon
 2                0        0          0 2 0x20200    1         scsibus0  sccomp
 1                0        1          0 2  0x4000    1             init
 0               -1        0          0 2 0x20200    1          swapper

eat is a simple C program that does exactly this (yes, I'm running 4 
instances of it):
	int main(void) { 
		for (; malloc(4); )
			continue; 
		return 0;
	}


db{0}> machine proc      
LWP 0xf2e405e0: PID:335.1 CPU:0 stat:7 vmspace:0xf2ddcda0 ctx: 0xf02d68f0 cpuset
 1
pmap:0xf2e80100 wchan:0x0 pri:86 upri:86
maxsaddr:0xe0000000 ssiz:1 pg or 1000B
profile timer: 0 sec 0 usec
pcb: 0xf2e7c000   

db{0}> machine cpu
CPU# CPUINFO    FLAGS    CURLWP     CURPROC    FPLWP
0    0xf2198000 9000     0xf2e405e0 0xf2e5f978 0xf29c0a18
1    0xf2199000 b000     0xf2e40558 0xf2e5f7e8 0x0


The system _doesn't_ panic(), but locks up. Here is the traceback:

db{0}> bt
cpu_Debugger(0xf059ee70, 0x0, 0x17, 0x9790, 0x480, 0x1) at netbsd:zsc_intr_hard+
0x158           
zsc_intr_hard(0x8, 0x7ffffc00, 0x130a91b5, 0x38c2, 0xffff, 0xa00) at netbsd:zsha
rd+0x40          
zshard(0x0, 0xf01c7928, 0xd00, 0x408000e6, 0xf0262800, 0xf028f800) at netbsd:spa
rc_interrupt44c+0x148    
sparc_interrupt44c(0x1, 0xa00, 0xf02608f8, 0x7cf3, 0x798, 0xf0261090) at netbsd:
hardclock+0x380  
hardclock(0xf028fd50, 0xf023b30c, 0x400040, 0x0, 0x600, 0x100) at netbsd:lockmgr
+0x4e8           
lockmgr(0xf028fd50, 0x400042, 0x400040, 0x408000e1, 0x100, 0xf02628f8) at netbsd
:sparc_interrupt44c+0x110
sparc_interrupt44c(0xf0213400, 0x6, 0x400040, 0x7cb7, 0x5b8, 0xf0262800) at netb
sd:hardclock+0x370
hardclock(0xf028fd50, 0xf023b494, 0x400040, 0x0, 0x600, 0x10117a24) at netbsd:lo
ckmgr+0x4e8     
lockmgr(0xf028fd50, 0x400042, 0x400040, 0x408000e4, 0xffff, 0x10147740) at 0xf00
0665c             
0xf000665c(0x4, 0xf01c180c, 0xe00, 0x408000a5, 0xffff, 0x36c102c) at netbsd:acqu
ire+0x4c        
acquire(0xf028fd50, 0xf2e7de4c, 0x400000, 0x0, 0x600, 0xf0248800) at netbsd:lock
mgr+0x4e8         
lockmgr(0xf028fd50, 0x400002, 0x400000, 0xf2e7dfb0, 0x400, 0xf0262800) at netbsd
:_kernel_proc_lock+0x14  
_kernel_proc_lock(0xf2e405e0, 0xf2e7df28, 0xfffffffc, 0x36c3000, 0x20, 0x100fda7
8) at netbsd:syscall+0x284
syscall(0x11, 0xf2e7dfb0, 0x1011915c, 0x1003d6a0, 0x0, 0xf2e7df28) at 0xf0006500
 

2) Sometimes, the system locks up without even printing anything to the console,
so my guess would be a deadlock of some kind in mem/swap handling, or am I
completly wrong ?

This is a SPARCStation 20 MP (2 * LM50 - 50Mhz, no external cache) with
128MB of RAM. The whole system was compiled with CPUFLAGS=-mcpu=supersparc.

Thanks in advance.

-- 
 Rui Paulo <rpaulo@netbsd-pt.org>        http://www.netbsd-pt.org/users/rpaulo/