tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Processes getting stuck in "fstchg" with NetBSD-10.99.12/amd64
> Date: Sat, 13 Sep 2025 01:35:39 -0700
> From: Brian Buhrow <buhrow%nfbcal.org@localhost>
>
> First, While I can't 100% reliably reproduce the issue, I can
> achieve the state fairly reliably by closing ssh sessions, without
> explicitly logging out, into the affected machine when connected
> through a particular Juniper firewall on our network. What appears
> to happen is I close the session and one of my csh processes gets
> stuck in specio wait, causing the root filesystem to be in suspended
> state.
Something is probably going wrong in revoking the pty in ssh -- that
is, some process still has the pty open and probably in the middle of
an I/O operation, and revoke(2) has failed to wrest it from them by
cancelling the I/O operation.
Do you have ptyfs mounted, or are you using legacy pty device nodes?
If you don't have ptyfs mounted, you should -- legacy pty device nodes
cause all kinds of trouble, and mounting ptyfs might work around this
symptom. However, legacy pty device nodes shouldn't cause _this_ kind
of trouble, so there is still a bug to investigate. Using ptyfs might
just mean you'll have unbounded growth of ptys -- but that might be
easier to investigate, since it won't cause everything to hang!
> Then, cron starts firing off jobs, each of which gets stuck in
> fstchg state until the process table gets full.
This is just compounding the underlying problem.
> Using ddb, I was able to gather the below information. I have more
> data than is shown here, but I don't have a full crash dump.
>
> Runing call fstrans_dump(1) I see:
>
> [ 306390.6288439] Fstrans state by mount:
> [ 306390.6288439] / owner 0xffffa6c9fb1c1c00 state suspended
>
> Then,
>
> 17174 17174 3 1 0 ffffa6c9fb1c1c00 csh specio
>
> Then,
>
> [ 306390.6288439] 17174.17 @0xffffa6ca2639e400 (/) shared 2 cow 0 alias 0
>
> Questions:
>
> I'm assuming it's bad to have the / filesystem be in suspended
> state?
It is not bad for / to be suspended, but it shouldn't remain
indefinitely.
Suspending the file system temporarily is a necessary part of updating
mount options (mount -u), for example.
It is also a necessary part of revoking a vnode such as a tty, which
openpty(3) and getty(8) do when you log in to ensure no past processes
can still be using it.
> What does the 2 represent after the word shared in the
> previous line?
This means that the thread in question has two nested
fstrans_start(mp)
calls (default is type `shared'; if it were fstrans_start_lazy you
would see `lazy' instead).
> Assuming I can get another crash, what details should I gather
> beyond these details the next time?
ps
ps/w
show all tstiles
For any thread of interest like the csh process ffffa6c9fb1c1c00
listed in fstrans_dump above, do
bt/a ffffa6c9fb1c1c00
to see what they were up to.
However, that csh process is unlikely to be the culprit: it's actually
waiting for another thread somewhere which still has the vnode open,
and revoke has failed to force it to close that vnode.
So you need to browse through ps to find who still has it open, maybe
using `show files <structprocptr>' for each <structprocptr> you can
find (use `show proc <pid>' or `show lwp <structlwpaddr>' to find
<structprocptr> from `ps' output).
You can probably safely narrow your search to children of your shell
process.
Home |
Main Index |
Thread Index |
Old Index