Subject: Re: "panic: chgproccnt: lost user" on dual-CPU SS20 with 1.6.1_STABLE
To: David Laight <david@l8s.co.uk>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 07/27/2003 13:36:20
[ On Sunday, July 27, 2003 at 07:47:57 (+0100), David Laight wrote: ]
> Subject: Re: "panic: chgproccnt: lost user" on dual-CPU SS20 with 1.6.1_STABLE
>
> The trace message your kernel has output indicated that you have
> changed the ruid of a process.

Yes, that's what I've done, in sys_seteuid(), though I think I've done
it in exactly the same way as sys_setuid() does it:

Index: sys/kern/kern_prot.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/kern_prot.c,v
retrieving revision 1.68
diff -c -r1.68 kern_prot.c
*** sys/kern/kern_prot.c	6 Dec 2001 23:11:59 -0000	1.68
--- sys/kern/kern_prot.c	24 Jun 2003 22:50:35 -0000
***************
*** 51,56 ****
--- 51,57 ----
  
  #include <sys/param.h>
  #include <sys/acct.h>
+ #include <sys/syslog.h>
  #include <sys/systm.h>
  #include <sys/ucred.h>
  #include <sys/proc.h>
***************
*** 405,410 ****
--- 406,438 ----
  	 * not see our changes.
  	 */
  	pc->pc_ucred = crcopy(pc->pc_ucred);
+ 	if (pc->pc_ucred->cr_uid == 0) {
+ 		/*
+ 		 * XXX Does this "raise the bar" enough if we only do this for
+ 		 * set-ID-0 programs?  Maybe we should just forget about this
+ 		 * silly saved-set-user-ID stuff and stop supporting it
+ 		 * completely!
+ 		 */
+ 		log(LOG_DEBUG, "%s: pid %d [eid %d:%d, rid %d:%d, svid: %d:%d] called seteuid(%d) as superuser, setting svid and ruid to %d\n",
+ 		    p->p_comm,
+ 		    p->p_pid,
+ 		    pc->pc_ucred->cr_uid,	/* effective ID */
+ 		    pc->pc_ucred->cr_gid,
+ 		    pc->p_ruid,			/* Real ID */
+ 		    pc->p_rgid,
+ 		    pc->p_svuid,		/* saved set-user-ID */
+ 		    pc->p_svgid,
+ 		    euid, euid);
+ 		/*
+ 		 * If running as root then behave almost as if setuid() were
+ 		 * called and prevent any future re-instatement of privilege.
+ 		 */
+ 		(void) chgproccnt(pc->p_ruid, -1);
+ 		(void) chgproccnt(euid, 1);
+ 		pc->pc_ucred = crcopy(pc->pc_ucred);
+ 		pc->p_svuid = euid;
+ 		pc->p_ruid = euid;
+ 	}
  	pc->pc_ucred->cr_uid = euid;
  	p_sugid(p);
  	return (0);


>  chgproccnt() is used to count the
> number of processes with each ruid (in order to enforce rlimit.nproc).

Yes, I do understand this part.

Since I first posted I've learned I can't trigger the crash on demand by
just running "calendar -a" from the command-line, though the same log
message from my change above always appears of course since calendar
calls seteuid(2).  (FYI calendar happens to be the last base program I'm
fixing as part of my complete set of changes to seteuid(), setreuid(),
etc.)

So what I still don't understand is the intermitent (i.e. not so easily
reproducible) nature of the problem.

Unfortunately since I last posted I've had some trouble getting a kernel
to boot which includes DDB.  I probably shouldn't have, but I added
DEBUG, DIAGNOSTIC, and '-g' as well at the same time I added DDB -- I've
been taking them out one-by-one but so far the only result is:


    Rebooting with command:                                               
    Boot device: /iommu/sbus/espdma@f,400000/esp@f,800000/sd@1,0  File and args: netbsd
    >> NetBSD/sparc Secondary Boot, Revision 1.12
    >> (woods@proven, Tue May 27 17:35:26 EDT 2003)
    Booting netbsd
    2974912+109204+284944 [189600+144611Instruction Access Exception
    Type  help  for more information
    <#0> ok 

One more rebuild to go before I need some new ideas again....

The only other thing different in some way about these failling kernels
is that I've been building them on the host whereas the original install
kernel which does boot but sometimes crashes was cross-built from my
i386 server.


BTW, thank you for at least showing your assumptions this time.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>