Re: kern/58317: hang in vcache_vget()

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,manu%netbsd.org@localhost
Subject: Re: kern/58317: hang in vcache_vget()
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Thu, 6 Jun 2024 17:55:02 +0000 (UTC)

The following reply was made to PR kern/58317; it has been noted by GNATS.

From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/58317: hang in vcache_vget()
Date: Fri, 07 Jun 2024 00:53:51 +0700

 I have seen that kind of thing as well - my pet (purely guessed)
 victim (imagined cause) was USB drive I/O, but it probably isn't related.

 One thing you should test is time - ie: just wait.   When I'm
 patient, things usually eventually recover.   That is, it isn't
 really hung, just waiting for something which takes a long time
 to complete.   And I do mean a long time, not "long by computer
 standards" (but not astronomic, or even archaeological long either).
 I mean tens of minutes, perhaps even an hour. 

 Things which are time sensitive (like keeping transfers running
 without the remote end deciding you've vanished as no more data
 is being taken - the TCP window sits at 0 for too long) tend to fail,
 but the system itself in my experience, tends to recover.

 That is, even though processes report being stuck in tstile
 waits, and that in the past that often represented a deadlock
 somewhere, I have not seen one of those now in a long time - these are,
 or seem to be, just waits that are waiting much longer than we'd want
 them to.

 And to answer (from my experience) Greg's question - yes, in my
 cases this tends to happen when there's memory pressure, but not
 the kind of pressure that should be bothering anything (in over 2
 years I've yet to see my system page anything out, ever - I have
 swap space, plenty, but it always shows 0 used) - there is
 something broken in some of the ubc/buffer cleanup code somewhere I think.
 Much of the used memory should simply have been flushed ages ago.

 My system has 64GB, which is quite a lot, but even if all of that
 (every single page) was data waiting to be written, which it
 obviously isn't, the slowest of my drives can handle 60MB/sec most
 of the time (and others much more) so all 64GB could be written in not
 much more than 1024 seconds, wherever it was destined, even if all to
 one of the slower drives, which is unlikely, but that is less than some
 of the hangs I have seen (30 mins or more).

 Further, even in that case, as some of all of that is written, it should
 be being discarded, leading to available memory, which should be
 allowing progress elsewhere, well before it is done.  But apparently not.

 My guess with that is that something is being locked, and stays locked,
 while all of this gets cleaned up - and that lock prevents almost all
 useful progress elsewhere from continuing.   It is worth noting that
 processes not doing actual I/O (like clocks, but also things like vmstat
 and iostat in xterms) keep on working.   So it looks like a tstile/deadlock
 issue, but isn't.

 I haven't yet managed to find which kernel thread is causing the problems.
 And that assumes there is just one of them, acting alone.

 Oh, and once it recovers, it is recovered (until next time) - it isn't as
 if lots of memory is being lost somewhere (or not enough that I can
 detect it anyway).

 Also, as it might be related, a while ago now, the system had a sudden stop
 due to power failure (either there was no UPS at the time, or the UPS
 gave up, I forget ... I'm not currently running any UPS monitoring system,
 which I know I should be, but that's not the issue here).   When the
 power returned, and the system rebooted, everything looked fine.   That is,
 except for data in some files was missing.   Now that's expected, data is
 buffered, and while we make sure (one way or another) that the meta-data
 (directories, inodes, etc) are all consistent, there's no guarantee that
 file data will have made it to disc.   What was surprising here is that
 when this happened, nothing much have been happening (it was an idle
 day for me as far as computer work was concerned).   Some of the files
 that ended up with no valid data were e-mail messages I had fetched & read
 more than 12 hours earlier (that is, before the power loss).   I use nmh,
 which uses one file for each message, and (as the meta data was all saved)
 the mod times of those files were there (and correct, or at least, close to
 when I'd expected they would be, I hadn't been making notes!)   But the
 message contents (the block contents) was just binary trash - whatever
 had been in the blocks when they were last used for something else, no
 signs of the e-mail message contents.   (I lost nothing, I just had to
 fetch and process that e-mail again, after deleting all the broken files).
 Most probably there was other damaged data as well, but that would have
 been harder to detect, and was probably unimportant.

 NetBSD "lost" update(8) some time ago - but since the above I have been running
 a sh script that does sync(2) (via sync(8)) every 30 secs (randomised a bit)
 ever since.   No more issues like that (though sync is one of the processes
 that will sleep for lengthy periods when hangs happen).   The syncs don't stop
 the hangs however (though they might sometimes seem to trigger a short one,
 if there has been lots of I/O happening recently - lots of file copying).

 kre

Prev by Date: Re: xsrc/58178: While building x11 with build.sh with use_tools, make use nbhost-mkdep instead of nbmkdep
Next by Date: xsrc/58318: Slow/incremental updating of Firefox menu entries
Previous by Thread: Re: kern/58317: hang in vcache_vget()
Next by Thread: Re: kern/58317: hang in vcache_vget()
Indexes:

Home | Main Index | Thread Index | Old Index