zfs not freeing znodes

To: tech-kern%netbsd.org@localhost
Subject: zfs not freeing znodes
From: Greg Troxel <gdt%lexort.com@localhost>
Date: Thu, 26 Sep 2024 08:41:06 -0400

I have been having problems with my netbsd-10 systems locking up.  This
has happened on at last 3 physical computers and 3 physical disks so I'm
as sure as I can be that it's not hardware.

Systems are all

  - netbsd-10 amd64
  - / and /usr on ffs, bulk data on zfs

and have had 4, 8, 24, or 32 G of RAM, which from the zfs viewpoint ranges
from marginal to considered enough.

The symptoms are that everything is fine for a time, and then it ends up
in lockup, no keyboard effective to switch out of X, ddb, anything.
Sometimes I catch it as it is deteriorating and have been able to get
into ddb and there are a bunch of processes in tstile, with underlying
locks including flt_noram5 (from fallible memory).

My guess is that I run low on memory and there is a locking bug (failure
to release) in a rarely-taken path, perhaps trying to delete files in
zfs when the system is out of RAM, or that sort of thing.

Things that tend to lead to higher odds of lockup are:

  - daily cronjob running (8 GB machine w/o X)
  - leaving firefox open, especially with piggy js tabs
  - running pkgsrc builds
  - anything that deals with very large numbres of files.

I have adjusted zfs's target allocations to use less RAM, basing it on
total.  In theory these would be sysctls anyway.

One thing I figured out before is that zfs's approach to respecting RAM
limits is to go ahead and allocate when requested and to have bg thread
free things.  This can result in going way over, and I think it makes
something like "untar this huge bunch of files into zfs" put memory
pressure on the rest of the system.

On a 32 GB machine, the lockups got more frequent (I can't rule out a
graphics card failure unrelated to the memory problem), so I started
looking harder. I ran vmstat -m before and after doing cvs updates in
NetBSD checkouts (I have them for 9, 10 and current), and pkgsrc.   I
noticed that "dnode_t" showed a large number of requests and pages and
*no releases*.

An example is
 - 1847937 requests
 - 307990 pages

That's 1203 MB in dnodes (which are 632 bytes each).  But the concerning
thing is that everytime I did an update of a tree (different ones) the
dnode allocation rose and I saw no frees.

I then remembered that I had bumped up kern.maxvnodes long ago, before I
was even using zfs, because a netbsd or pkgsrc tree was not fitting in
the cache.

maxvnode was at about 1.6M.  This seemed big, and I set it to 500K.
Then additional "call stat on this huge bunch of files" did not result
in new allocations, but I didn't see any frees.  This was all yesterday
or Tuesday.

This morning there are 1592742 releases, but no pages have been
released.

I went to see what I was setting maxvnodes to, and it seems I removed
that setting long ago, probably when I upgraded my main machine from an
8G box to 24G box, or earlier.


This all leaves a lot of questions:

  Obviously I need to read the zfs code to see what dnode is being used
  for (am guessing it's "disk node", the on-disk info backing a vnode) and
  how the number of allocations is controlled and they are freed.

  1.6M vnodes on a system with 32G of RAM seems like it should be ok.  I
  plan to set it to 500K on boot and see if that avoids lockups.

  zfs's background free, don't sleep processes that are over strategy
  seems kind of risky.  I can see a "if mildy over, let bg free deal",
  but if a process is just allocating aas fast as it can, it seems to
  run the system into no ram.

  There remains the question of what happens when there isn't RAM left
  to allocate from pools.  I think it's highly likely there is a bug.


Things on my todo list to debug:

  Set up a VM that has ZFS and see if I can make a repro recipe.

  In the VM, also try DIAGNOSTIC/DEBUG/LOCKDEBUG.

  Write code to capture vmstat output periodically and save it.  I
  expect graphing this to be useful in understanding.
   

What I don't understand is why others aren't seeing this.  I do have
settings to avoid having the file cache page out all my processes:

  # \todo Reconsider and document
  vm.filemin=5
  vm.filemax=10
  vm.anonmin=5
  vm.anonmax=80
  vm.execmin=5
  vm.execmax=50
  vm.bufcache=5

But this is I believe pretty normal among netbsd users.

However, on the other logical machine, which is a Xen dom0, I have a a
stock sysctl.conf, and it would reliably crash on the daily cron with 4
GB of RAM, but stay up when running GENERIC with the full 8GB.

Prev by Date: Re: assertion "lp_max >= core_max" failed
Next by Date: enable KGDB causes the build failure
Previous by Thread: assertion "lp_max >= core_max" failed
Next by Thread: enable KGDB causes the build failure
Indexes:

Home | Main Index | Thread Index | Old Index