NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/50627: filemon can hang the process
The following reply was made to PR kern/50627; it has been noted by GNATS.
From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: kern/50627: filemon can hang the process
Date: Wed, 06 Jan 2016 13:35:40 +0700
Date: Wed, 6 Jan 2016 01:45:00 +0000 (UTC)
From: paul%whooppee.com@localhost
Message-ID: <20160106014500.65AA47A21B%mollari.NetBSD.org@localhost>
| However, you can work-around the problem by using an
| atexit() handler to close the filemon device's fd first
In-kernel atexit() could solve this problem, and could also potentially
have uses for other problems, but when things get complex (I don't know
if it is possible, but consider a process using filemon twice, where one
of them is logging changes to a file monitored by the other) the atexit()
order processing is just going to be the same problem all over again -
something will need to make sure those get done in the correct order,
which is not necessarily either establishment order, or the reverse of that.
Since what you have here is a "do it in this order" problem, it seems as
if one of the classic solutions to that problem might be more in order.
One would be to use explicit dependencies - that is what I thought of
first (but...) In your case this would work like
1. Process opens /dev/filemon and gets fd #3
2. Process tells filemon to log activity to fd #1 (stdout)
2a. fd #1 now depends upon fd#3
3. Process calls sys_exit(), which starts process cleanup
4. Clean-up code tries to fd_close all open descriptors, in
order, so handles fd #0 and then fd #1
4a. fd#1 depends upon fd#3, so close fd#3 first, which releases
the ref on fd#1, then return to:
5. fd #1 has another reference, so we wait on the condvar,
which never gets broadcast since there's no other thread
to run. We hang here forever.
which is no longer true, as that other reference has gone away, and the
close of fd#1 succeeds normally. After going past #2 we reach #3
but now that is just an (old) closed descriptor for which there is
nothing to do (just like 4 5 ... presumably)
The problem with this is that it (probably) needs a double linked list
between fds to work properly (whether those are pointers, or just fd
numbers doesn't matter) and then the locking to maintain it all properly
(at close, setup is easy) would get difficult at best I think.
An alternative is just to (separately) maintain a correctly ordered
close list, and always close in that order. For most (non filemon
using, nor any other similar requirement) processes, this list would be
null, and the closes would just happen in numeric order like now.
But when filemon (or anything similar) is in use, the process would be ...
1. Process opens /dev/filemon and gets fd #3
2. Process tells filemon to log activity to fd #1 (stdout)
2a. processes notes that this is its first activity which requires
ordered closes, and so allocates a close order list, and links that to
the process. (This would be an array of size max_fd, of element type
some int big enough to hold a fd - char, short, whatever it takes).
This is init'd with values in ascending integer order (so of itself,
would change nothing).
2b. logging to fd1 makes fd1 depend upon fd3, so fd3 must be closed
before 1, adjust the order of the list so 3 is inserted ahead of 1
(this is a slice and reassemble operation on the array.)
Alt: move 1 after 3.
Which to do depends on whether anything previously depended upon 1 or 3
or whether either of those already depends upon anything else - setting
a flag (on both fds) to indicate an ordering dependency exists, would work,
then error out of this (ie: filemon ioctl fails) if both relevant fds
already have the flag set, otherwise if one does, move the position of
the other (if neither do, toss a coin .. or follow a rule in the code
more likely) - set the flag on both fds when done.
The file flag bits force the (relative) order of those fds to remain
unchanged, so once a dependency is set, it cannot be altered later
(unless a fd is closed of course, that would reset the flag for a later open.)
3. Process calls sys_exit(), which starts process cleanup
3a. process notices that it has a close order list, so it calls
fd_close() on all open descriptors in the order that they appear
on that list, which in this case would be 0 3 1 2 ...
(or perhaps 0 2 3 1 ... it should make no difference as long as
3 gets closed before 1).
3b. skip to step 6 (4 is done already, and 5 never happens)
4. Clean-up code tries to fd_close all open descriptors, in
order, so handles fd #0 and then fd #1
5. fd #1 has another reference, so we wait on the condvar,
which never gets broadcast since there's no other thread
to run. We hang here forever.
There are no locking issues of note with this one (locks are needed, but
they're nothing unusual, and probably already covered by existing locking)
step 2a could be optimised (for space) slightly by making the array only
as big as the number of current open fds (then it would need to be able to
be extended should that be needed later - unlikely but possible) in which
case after 3a we would just continue to do step 4 - the initial group
of fds (those open at the time of 2a) would already be closed, and so just
skipped at step 4, which would then close any fds opened later.
There are most likely other solutions worth investigating before even
contemplating the mechanism of adding some kind of atexit() (for processes,
rather than the kernel itself, of course) to the kernel.
kre
Home |
Main Index |
Thread Index |
Old Index