Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: building netbsd-9 2 'sync' processes stuck in 'tstile'
On Sat, 8 May 2021, Robert Elz wrote:
> | I just ran a full forced 'fsck -yf' on it just prior to these events.
> | That was prompted by CVS failing to clean up a directory.
>
> That seems like an unusual response, using fsck to fix things (I assume
> on an ummounted filesystem, otherwise it is definitely wrong) isn't typically
> needed - that is required after the system has crashed, possibly
> leaving unsaved updates, which need to be repaired (made consistent
> at least). But as long as the system is still running, nothing is
> lost, and the filesystems should all be fine (if not there are far more
> serious problems - booting after an unclean shutdown without having done
> a fsck can get you into that kind of situation).
In this case, there is a directory, but when CVS tries to delete it, it
reports "Could not delete <some directory>: no such file or directory"
and aborts the update. Re-running the update fails the same way. Trying
to do so manually produces the same result. The filesystem always
reports being clean, but 'fsck -yf' always finds problems with the file
or directory in question, ususally missing "." and/or ".." for directories,
sometimes an impossibly large block number.
I wait until the system is quiescent and/or clients have finished or
reached a convenientt stopping point, reboot single user, manually bring
up the RAID, check parity and then run 'fsck -yf' on everything, just
to be sure, then reboot again.
> | I get those
> | from time to time after the near-catastrophic events that prompted
> | kern/55115. I used to get them frequently. Now they are less common.
> | The carnage might still have caught the build this time.
>
> First, that PR is apparently fixed now right? It is still waiting feedback
> from you to confirm that.
I'm waiting for my clients' tasks to finish so I can reboot the machine,
test with a -current kernel containing the fix and if successful request
pullup to netbsd-9.
> If the disk controller is still not working properly, then almost anything
> is possible. If it is, then provided everything looks clean to fsck, there
> should be nothing which would trigger a kernel locking problem - those tend
> to be more caused by internal race conditions (sometimes by little used error
> paths forgetting to release a semaphore).
It's not that the controller is malfunctioning, per se, but that when
I rebooted the machine with a kernel after MSI was enabled for siisata(4),
this controller couldn't cope with that and my then-autoconfigured RAID
got hosed. I recovered using 'raidctl -C' to force configuration,
rebuild parity and fix the filesystem, but there has been lingering
damage from that event that I've been cleaning up ever since. As I
said, these problems used to happen more frequently, but as more and
more blocks get allocated, new allocations occasionally stray into areas
that still have problems.
Before I reboot I'll see about getting a backtrace on the stuck processes
as suggested by Greg Woods.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]consolidated[flyspeck]net OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
Home |
Main Index |
Thread Index |
Old Index