port-hp300: ps sometimes goes into an infinite loop

Subject: ps sometimes goes into an infinite loop
To: None <port-hp300@NetBSD.ORG>
From: Duncan McEwan <duncan@Comp.VUW.AC.NZ>
List: port-hp300
Date: 04/03/1996 12:42:00
Sorry about the length of this: hopefully most of what I've included will be
useful in tracking down this problem...

I'm not sure if this is hp300 specific, or if it has been fixed in
netbsd-current, but this is a problem I see on an hp425 running NetBSD 1.1.  If
it is not fixed in current, and there is no easy fix available, I'll send-pr
it.

We have a cron job that runs every 15 minutes to check on the status of a
particular process.  It does this with a 'ps -uaxww | grep ...' command.  Every
so often (maybe about once/day) I find a looping ps process -- ie:

  PID USERNAME PRI NICE   SIZE   RES STATE   TIME   WCPU    CPU COMMAND
...
25699 root      93    0   508K   24K run   852:49 47.07% 47.07% ps
...

I used ktrace to determine that it was continuously reading 56 bytes from
location 0 of /dev/kmem.

 25699 ps       RET   read 56/0x38
 25699 ps       CALL  lseek(0x4,0,0,0,0)
 25699 ps       RET   lseek 0
 25699 ps       CALL  read(0x4,0xffeff870,0x38)
 25699 ps       GIO   fd 4 read 56 bytes
       "N\M-x\^D\0\0\0\0\0\0\0\^D*\0\0\^D\M^X\0\0\^F(\0\0\^F2\0\0\^F<\0\0...
 25699 ps       RET   read 56/0x38
 25699 ps       CALL  lseek(0x4,0,0,0,0)
 25699 ps       RET   lseek 0
 25699 ps       CALL  read(0x4,0xffeff870,0x38)
 25699 ps       GIO   fd 4 read 56 bytes
       "N\M-x\^D\0\0\0\0\0\0\0\^D*\0\0\^D\M^X\0\0\^F(\0\0\^F2\0\0\^F<\0\0...
 25699 ps       RET   read 56/0x38
 ...

(lsof told me that fd 4 was /dev/kmem).

Next, I compiled and installed a version of ps with debugging symbols in both
the ps and the libkvm .o files.

Now, when I catch a rogue ps, I can use gdb to see where it is looping.  Here
is what I find:

gdb /bin/ps 25699
(gdb) where
#0  kvm_read (kd=0x2d000, kva=0, buf=0xffeff854, len=56) at kvm.c:478
#1  0x5940 in _kvm_uread (kd=0x2d000, p=0x0, va=4293918704, cnt=0xffeff8c0)
    at kvm_proc.c:121
#2  0x6866 in kvm_uread (kd=0x2d000, p=0x42640, uva=4293918704, buf=0xffeff8fc "", 
    len=16) at kvm_proc.c:819
#3  0x6782 in kvm_doargv (kd=0x2d000, kp=0x42640, nchr=0, info=0x66ba <ps_str_a>)
    at kvm_proc.c:761
#4  0x6800 in kvm_getargv (kd=0x2d000, kp=0x42640, nchr=0) at kvm_proc.c:788
#5  0x1d3c in command (ki=0x5a32e, ve=0x2c0b0) at print.c:143
#6  0x3850 in main (argc=0, argv=0xffefff54) at ps.c:335
(gdb) print head
$4 = 34661924

Single stepping from here shows that it is looping in _kvm_uread, in the loop
starting at line 120 of kvm_proc.c.

	head = (u_long)&p->p_vmspace->vm_map.header;
        addr = head;
        while (1) {
                if (KREAD(kd, addr, &vme))
                        return (0);

                if (va >= vme.start && va < vme.end && 
                    vme.object.vm_object != 0)
                        break;

                addr = (u_long)vme.next;
                if (addr == head)
                        return (0);
        }

(gdb) print va
$1 = 4293918704
(gdb) print vme
$2 = {prev = 0x4ef80400, next = 0x0, start = 1066, end = 1176, object = {
    vm_object = 0x628, share_map = 0x628, sub_map = 0x628}, offset = 1586, 
  is_a_map = 1596, is_sub_map = 1606, copy_on_write = 1616, needs_copy = 1820, 
  protection = 1576, max_protection = 1432, inheritance = 1666, wired_count = 1524}
(gdb) print addr
$3 = 0
(gdb)

You can see from the above that "va" does not lie between vme.start and
vme.end, so we will never break out of the loop.  Furthermore, vme.next is 0x0,
so 'addr' will remain 0, so we will continue trying to KREAD from addr 0 into
&vme.  The value for 'va' looks OK.  It is calculated in kvm_doargv() at line
762 of kvm_proc.c as 'USRSTACK - sizeof(arginfo)'.

So how does addr become 0 in the first place?

One possibility is that the 'p' parameter to _kvm_uread() has the seemingly
bogus value of 0x0 (from the stack trace above).  If this is bogus, the value
of 'head' calculated at line 118 is likely to be bogus as well.

But I can't figure out how the 'p' parameter gets the value 0x0.  Notice how
the stack backtrace also shows the 'p' parameter to kvm_uread() (no '_') is
0x42640.  On examining the code around line 804 of kvm_proc.c, I can't see how
the 'p' passed to _kvm_uread() could become 0x0 (stack corruption doesn't seem
likely since the parameters on either side of it are unchanged).  Am I missing
something here, or is gdb lying to me?

If it turns out that this is simply a consquence of the fact that the system is
changing and so 'ps' gets an inconsistent view of the system state, then
perhaps ps (or libkvm) needs some additional paranoia to prevent this loop.

Thanks in advance for any assistance anyone can offer.

Duncan