port-amd64/52679: amd64 pmap page leak?

To: port-amd64-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: port-amd64/52679: amd64 pmap page leak?
From: dholland%eecs.harvard.edu@localhost
Date: Tue, 31 Oct 2017 08:30:00 +0000 (UTC)

>Number:         52679
>Category:       port-amd64
>Synopsis:       amd64 pmap page leak?
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-amd64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 31 08:30:00 +0000 2017
>Originator:     David A. Holland
>Release:        NetBSD 8.99.1 (20170809)
>Organization:
>Environment:
System: NetBSD macaran 8.99.1 NetBSD 8.99.1 (MACARAN) #42: Wed Aug 9 22:31:11 EDT 2017 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Architecture: x86_64
Machine: amd64
>Description:

Today one of my machines deadlocked due to what turned out to be
garden-variety kva exhaustion: the X server went into D state with
wchan "vmem" and backtrace from crash(8) showed pool_grow and
vmem_alloc.

The proximate cause was a 5GB browser process but in the course of
investigating it looked substantially like a lot of system memory had
gone missing.

The first few lines of vmstat -s output:
     4096 bytes per page
        8 page colors
  1523017 pages managed
     5290 pages free
   500583 pages active
   248837 pages inactive
        0 pages paging
      763 pages wired
     4365 zero pages
        1 reserve pagedaemon pages
       20 reserve kernel pages
   429542 anonymous pages
   274459 cached file pages
    46182 cached executable pages
     1024 minimum free pages
     1365 target free pages
   507672 maximum wired pages
        1 swap devices
  1587221 swap pages
  1137567 swap pages in use
  1593324 swap allocations

Since free + inactive + active + wired + zero should add to roughly
managed, it looks like half the system memory's disappeared somewhere.

vmstat -m said the kernel was using roughly 1.5G, but even if that
isn't counted above there's still 1.5G missing. Is there some other
category not displayed that managed pages can be in?

(The machine has 6G of ram and 6G of swap, and it ought to be able to
handle a 5G browser process without going 4G into swap, since there
wasn't anything else large running. For a while a few days ago I was
running a second not-small browser process as well, but it was shut
down ~36 hours before the events today.)

It's odd that this should have so many approximate halves in it (6G
total -> 3G reported above -> 1.5G used by the kernel) but maybe
that's just the condition required for it to splode.

>How-To-Repeat:

Thrash memory on and off for several days, I guess...

>Fix:

Dunno. Confirmation that this does actually reflect a problem would be
a helpful first step.

It would also be useful if vmstat -s output came as groups of page
counts that were specifically supposed to add up, to make these
diagnoses easier.

I'm filing this in port-amd64 because it's presumptively a pmap-level
issue until proven otherwise... unless it's a false alarm and
something else entirely was going on.

Follow-Ups:
- re: port-amd64/52679: amd64 pmap page leak?
  - From: matthew green

Prev by Date: PR/52665 CVS commit: src
Next by Date: Re: bin/52678 (makemandb crashes)
Previous by Thread: PR/52665 CVS commit: src
Next by Thread: re: port-amd64/52679: amd64 pmap page leak?
Indexes:

Home | Main Index | Thread Index | Old Index