Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: random lockups (now suspecting zfs)

Simon Burge <> writes:

> Greg Troxel wrote:
>>  Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote:
>>> A different machine has locked up, running recent netbsd-10.  I was
>>> doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
>>> physical.  It has a private patch to reduce the amount of memory used
>>> for ARC, which has been working well.
> Are you still seeing the problem below even with limiting the amount of
> memory ARC can use?

Yes.  I have been running with limited ARC for a long time, since when I
posted my patch.  I find that just doing lots of zfs activity, enough
that I would have over-used RAM for ARC, is ok.  On my 32G system, my
boot messages are

  ARCI 002 arc_abs_min 16777216
  ARCI 002 arc_c_min 1067485440
  ARCI 005 arc_c_max 4269941760
  ARCI 010 arc_c_min 1067485440
  ARCI 010 arc_p     2134970880
  ARCI 010 arc_c     4269941760
  ARCI 010 arc_c_max 4269941760
  ARCI 011 arc_meta_limit 1067485440
  ZFS filesystem version: 5

or about 4G for ARC.  On my  8G physical/4G dom0 system:

  ARCI 002 arc_abs_min 16777216
  ARCI 002 arc_c_min 131072000
  ARCI 005 arc_c_max 524288000
  ARCI 010 arc_c_min 131072000
  ARCI 010 arc_p     262144000
  ARCI 010 arc_c     524288000
  ARCI 010 arc_c_max 524288000
  ARCI 011 arc_meta_limit 131072000
  ZFS filesystem version: 5

it's 524 MB.   I think it would be good to commit something like my
patch, but people have said large-memory systems shouldn't have a
change.  I think that's wrong; as I see it NetBSD's code oversizes ARC
compared to upstream, for no good reason.  But the fix to that is to
make it settable and then the default isn't so important.

>> >> All 3 tmux windows show something like
>> >> 
>> >>   [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% 6396k
>> >> 
>> >> and I can switch among them and ^T, but trying to run top is stuck (in
>> >> flt_noram5).  I'll give it an hour or so, and have a look at the
>> >> console.
> I've seen cc1plus processes wedged in either flt_noram or tstile after
> doing multiple builds, and a reboot is the only way out.  I'm using ZFS
> for everything except swap and some mostly-unused media files that live
> on an FFS.

Perhaps I failed to say that the box sometimes fails to respond to ping
when it gets like this.

>> So to me this feels like a locking botch in a rare path in zfs.
> This appears to be the case.  Chuck Silvers has some understanding of
> the problem and I'm helping test, but at this stage there isn't a fix
> available. :/

That's great to hear that someone has an idea.

So far I can't reproduce this on demand, but it does seem that running a
pkg_rr in the dom0 and in a domU at the same time tends to provoke it.
The domU has two virtual disks, one for files and one for swap.  Both
disk's backing files are zvols.

The files one is UFS2 (no zfs in domU).  pkgsrc, distfiles, packages,
and tmp for piggy programs are all nfs from dom0, and the dom0 is UFS2
for / and /usr, with pkgsrc,packages,distfiles,tmp-for-piggy all on zfs.

I suppose it might help for me to build with LOCKDEBUG and then try
builds.  Surely that will be slow, but is it likely enough illuminating
that it makes sense to try?

Home | Main Index | Thread Index | Old Index