CVS commit: src/sys/rump/librump/rumpkern

To: source-changes%NetBSD.org@localhost
Subject: CVS commit: src/sys/rump/librump/rumpkern
From: "Taylor R Campbell" <riastradh%netbsd.org@localhost>
Date: Sun, 6 Apr 2025 01:13:55 +0000

Module Name:    src
Committed By:   riastradh
Date:           Sun Apr  6 01:13:55 UTC 2025

Modified Files:
        src/sys/rump/librump/rumpkern: lwproc.c

Log Message:
rump: Nix leaked struct lwp on every rump_server syscall.

This leak was introduced by lwproc.c rev. 1.51 back in 2020, when the
kmem_zalloc(sizeof(*l), KM_SLEEP) was factored out of callers and
into lwproc_makelwp:

-static void
-lwproc_makelwp(struct proc *p, struct lwp *l, bool doswitch, bool procmake)
+static struct lwp *
+lwproc_makelwp(struct proc *p, bool doswitch, bool procmake)
 {
+       struct lwp *l = kmem_zalloc(sizeof(*l), KM_SLEEP);
...
@@ -435,8 +431,12 @@ rump_lwproc_newlwp(pid_t pid)
                kmem_free(l, sizeof(*l));
                return EBUSY;
        }
+       mutex_exit(p->p_lock);
        mutex_exit(&proc_lock);
-       lwproc_makelwp(p, l, true, false);
+
+       /* XXX what holds proc? */
+
+       lwproc_makelwp(p, true, false);

Unfortunately, the kmem_zalloc in rump_lwproc_newlwp was not deleted.
So it just leaked.

And this routine is called on _every_ syscall handled by rump_server
(via struct rumpuser_hyperup::hyp_lwproc_newlwp):

    674 static void
    675 serv_handlesyscall(struct spclient *spc, struct rsp_hdr *rhdr, uint8_t *data)
    676 {
    677         register_t retval[2] = {0, 0};
    678         int rv, sysnum;
    679
    680         sysnum = (int)rhdr->rsp_sysnum;
    681         DPRINTF(("rump_sp: handling syscall %d from client %d\n",
    682             sysnum, spc->spc_pid));
    683
 => 684         if (__predict_false((rv = lwproc_newlwp(spc->spc_pid)) != 0)) {
    685                 retval[0] = -1;
    686                 send_syscall_resp(spc, rhdr->rsp_reqno, rv, retval);
    687                 return;
    688         }
    689         spc->spc_syscallreq = rhdr->rsp_reqno;
    690         rv = rumpsyscall(sysnum, data, retval);

https://nxr.netbsd.org/xref/src/lib/librumpuser/rumpuser_sp.c?r=1.77#684

So this leak would grow fairly quickly in processes issuing rump
syscalls to rump_servers, which t_sp:sigsafe's helper h_sigcli does
as fast as it can in a loop for 5sec -- which is just long enough for
the i386 releng testbed with about 128 MB of RAM to run out of memory
in roughly half of the test runs, but not long enough on my laptop
with 64 GB of RAM to ever reproduce the problem.

Found by:

(a) running the test for 20sec rather than 5sec to amplify the leak;
(b) counting bytes allocated by return addresses of rumpuser_malloc;
(c) chasing that to rump_hypermalloc (culprit) vs uvm_km_alloc;
(d) chasing that to pgctor, uvm_map, uvm_km_kmem_alloc (culprit),
    vmapbuf;
(e) then to pool_page_alloc, then to pool_allocator_alloc, then to
    pool_grow, then to pool_get, then to pool_cache_get, then to
    kmem_zalloc;
(f) finally to the call to kmem_zalloc in rump_lwproc_newlwp, which
    at last yielded to my stare.

This instrumentation was extremely ad hoc -- I just created a table
of 4096 entries for each routine, and populated by linear scan with
atomics:

struct {
        void *volatile ra;
        volatile unsigned long n;
} pool_cache_get_bytes;

void *
kmem_zalloc(size_t size, km_flags_t flags)
{
        ...
        void *const ra = __builtin_return_address(0);
        size_t i;
        for (i = 0; i < __arraycount(kmem_zalloc_bytes); i++) {
                if (pool_cache_get_bytes[i].ra == ra ||
                    (pool_cache_get_bytes[i].ra == NULL &&
                        atomic_cas_ptr(&pool_cache_get_bytes[i].ra, NULL,
                            ra) == NULL)) {
                        atomic_add_long(&pool_cache_get_bytes[i].n, size);
                        break;
                }
        }
        ...
}

Would be nice to systematize this.  Would also be nice to bring back
malloc tags for accounting purposes so you don't need to match up the
kmem_zalloc return addresses with the kmem_free return addresses to
find leaks -- I got lucky here because there were very few return
addresses to piece through.

PR misc/59252: tests/rump/rumpkern/t_sp:sigsafe: out of memory


To generate a diff of this commit:
cvs rdiff -u -r1.58 -r1.59 src/sys/rump/librump/rumpkern/lwproc.c

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Prev by Date: CVS commit: src/sys/dev/pci
Next by Date: CVS commit: src/sys/rump/librump/rumpkern
Previous by Thread: CVS commit: src/sys/dev/pci
Next by Thread: CVS commit: src/sys/rump/librump/rumpkern
Indexes:

Home | Main Index | Thread Index | Old Index