Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: building netbsd-9 2 'sync' processes stuck in 'tstile'



    Date:        Sat, 8 May 2021 09:01:18 -0500 (CDT)
    From:        "John D. Baker" <jdbaker%consolidated.net@localhost>
    Message-ID:  <Pine.NEB.4.64.2105080842120.1246%spike.technoskunk.fur@localhost>


  | I wait until the system is quiescent and/or clients have finished or
  | reached a convenientt stopping point, reboot single user, manually bring
  | up the RAID, check parity and then run 'fsck -yf' on everything, just
  | to be sure, then reboot again.

Actually, since it looks as if your raidframe parity maps might be
scrambled, what I'd do (assuming this is raid1, raid5 gets messier)
is backup everything that needs saving, fail one raid component,
backup again (using just the remaining half of the mirror), and then
shutdown and fsck, using just the one half of the mirror (with the
other side still failed).   Once you get a clean fsck out of that,
re-add the other raid component as a new device, and let raidframe
copy everything to it.

If you have lots of patience, you could take the underlying components
(when things are quiet, and raidframe disabled) and cmp them, ignoring
the raidframe header, copy from one drive to the other any blocks that
show up as being different.

  | I'm waiting for my clients' tasks to finish so I can reboot the machine,
  | test with a -current kernel containing the fix and if successful request
  | pullup to netbsd-9.

OK.

  | but there has been lingering
  | damage from that event that I've been cleaning up ever since.

If you keep trying to fix it piece by piece, it wull take forever and
you'll never be sure you have it all done.   Better to simply force the
raid to be correct, rather than try and patch around it.   You're not
going to know which component has more good data than the other, so there's
no way to really know which to fail - which is why doing backups first
is a good idea (if you can, backup from each component as if it were the
whole thing, that would give the best chance of saving all of the data,
but would be ugly to put back together).

  | Before I reboot I'll see about getting a backtrace on the stuck processes
  | as suggested by Greg Woods.

That might help, particularly if it turns out that there is something other
than filesystem (likely vnode or mount point) involved in this - but given it
is sync that is hanging, that's not all that likely.

You can do that now using crash(8) though, those processes are going nowhere,
they're nice and stable, so this is one time when crash (8) on a running
system is very likely to give meaningful results.   Just don't use -w with
crash and it will do no harm to anything at all (except perhaps whoever is
trying to analyse the results - and that's just mental anguish).

kre



Home | Main Index | Thread Index | Old Index