NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/58317: hang in vcache_vget()
The following reply was made to PR kern/58317; it has been noted by GNATS.
From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: kern/58317: hang in vcache_vget()
Date: Fri, 07 Jun 2024 00:53:51 +0700
I have seen that kind of thing as well - my pet (purely guessed)
victim (imagined cause) was USB drive I/O, but it probably isn't related.
One thing you should test is time - ie: just wait. When I'm
patient, things usually eventually recover. That is, it isn't
really hung, just waiting for something which takes a long time
to complete. And I do mean a long time, not "long by computer
standards" (but not astronomic, or even archaeological long either).
I mean tens of minutes, perhaps even an hour.
Things which are time sensitive (like keeping transfers running
without the remote end deciding you've vanished as no more data
is being taken - the TCP window sits at 0 for too long) tend to fail,
but the system itself in my experience, tends to recover.
That is, even though processes report being stuck in tstile
waits, and that in the past that often represented a deadlock
somewhere, I have not seen one of those now in a long time - these are,
or seem to be, just waits that are waiting much longer than we'd want
them to.
And to answer (from my experience) Greg's question - yes, in my
cases this tends to happen when there's memory pressure, but not
the kind of pressure that should be bothering anything (in over 2
years I've yet to see my system page anything out, ever - I have
swap space, plenty, but it always shows 0 used) - there is
something broken in some of the ubc/buffer cleanup code somewhere I think.
Much of the used memory should simply have been flushed ages ago.
My system has 64GB, which is quite a lot, but even if all of that
(every single page) was data waiting to be written, which it
obviously isn't, the slowest of my drives can handle 60MB/sec most
of the time (and others much more) so all 64GB could be written in not
much more than 1024 seconds, wherever it was destined, even if all to
one of the slower drives, which is unlikely, but that is less than some
of the hangs I have seen (30 mins or more).
Further, even in that case, as some of all of that is written, it should
be being discarded, leading to available memory, which should be
allowing progress elsewhere, well before it is done. But apparently not.
My guess with that is that something is being locked, and stays locked,
while all of this gets cleaned up - and that lock prevents almost all
useful progress elsewhere from continuing. It is worth noting that
processes not doing actual I/O (like clocks, but also things like vmstat
and iostat in xterms) keep on working. So it looks like a tstile/deadlock
issue, but isn't.
I haven't yet managed to find which kernel thread is causing the problems.
And that assumes there is just one of them, acting alone.
Oh, and once it recovers, it is recovered (until next time) - it isn't as
if lots of memory is being lost somewhere (or not enough that I can
detect it anyway).
Also, as it might be related, a while ago now, the system had a sudden stop
due to power failure (either there was no UPS at the time, or the UPS
gave up, I forget ... I'm not currently running any UPS monitoring system,
which I know I should be, but that's not the issue here). When the
power returned, and the system rebooted, everything looked fine. That is,
except for data in some files was missing. Now that's expected, data is
buffered, and while we make sure (one way or another) that the meta-data
(directories, inodes, etc) are all consistent, there's no guarantee that
file data will have made it to disc. What was surprising here is that
when this happened, nothing much have been happening (it was an idle
day for me as far as computer work was concerned). Some of the files
that ended up with no valid data were e-mail messages I had fetched & read
more than 12 hours earlier (that is, before the power loss). I use nmh,
which uses one file for each message, and (as the meta data was all saved)
the mod times of those files were there (and correct, or at least, close to
when I'd expected they would be, I hadn't been making notes!) But the
message contents (the block contents) was just binary trash - whatever
had been in the blocks when they were last used for something else, no
signs of the e-mail message contents. (I lost nothing, I just had to
fetch and process that e-mail again, after deleting all the broken files).
Most probably there was other damaged data as well, but that would have
been harder to detect, and was probably unimportant.
NetBSD "lost" update(8) some time ago - but since the above I have been running
a sh script that does sync(2) (via sync(8)) every 30 secs (randomised a bit)
ever since. No more issues like that (though sync is one of the processes
that will sleep for lengthy periods when hangs happen). The syncs don't stop
the hangs however (though they might sometimes seem to trigger a short one,
if there has been lots of I/O happening recently - lots of file copying).
kre
Home |
Main Index |
Thread Index |
Old Index