[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: failure in NetBSD while running as root
+ martin, pooka, who have seen this periodically
On Tue, Mar 29, 2011 at 7:42 PM, Julio Merino <jmmv%netbsd.org@localhost> wrote:
> On Tue, Mar 29, 2011 at 7:37 PM, Jeff Rizzo <riz%netbsd.org@localhost> wrote:
>> atf-run: ERROR: XXX: Cannot get information of /tmp/atf-run.14884b/mnt;
>> atf-run: lstat(2) failed: Device not configured
>> g4:riz /usr/tests/fs/psshfs>
>> I have actually removed the umount, and while I don't get the error, it just
>> fails sometimes.
> Aha! I could eventually get it to fail here with "Device busy",
> although it surely is the same issue. It smells like race condition.
> I'll take a look.
Alright. I know what's happening. The offending test case is
mounting a rump file system and it is running a daemon in the
background that creates a pid file in the work directory. During the
test case cleanup, the file system is unmounted and the server is
killed. And here comes the race condition: neither unmounting the
puffs file system nor the termination of the server (with the
accompanied removal of its pid file) are synchronous.
When atf-run attempts to do the work directory clean up, it scans a
still-changing file system and bad things happen. For example, it may
enumerate the directory contents first and, later, when attempting to
delete a supposedly-existing file, get a ENOENT. Or it may try to
enumerate the contents of a mount point at the same time as the puffs
server process is exiting.
I have committed revision 648ed6360b2b7cda81a6079b00dc436d09c745b8
which implements a workaround for this situation: the "fix" is to
either retry failing file system operations a few times in an attempt
to allow the work directory to stabilize, or to ignore
Now, this feels like a very ugly hack but I'm not sure how we'd do
better. For file systems, I see that fuse has a "sync_unmount"
command/flag (dunno what exactly it is) that was added for exactly
this purpose. For daemons... we can't control their termination
because they are outside of the process group; the test case could
either kill -9 the server or explicitly remove the pid file before
exiting, but these are even worse workarounds. I'm open to ideas.
Martin, Jeff: I have pulled this change into NetBSD's src. Could you
give it a try? I'd like to release atf-0.13 RSN (like today or
tomorrow ;-) and it'd be great if this issue was gone once and for
all. If the patch does not fix the problem, the following should
trigger the failure pretty quickly:
# cd /usr/tests/fs/psshfs ; while atf-run t_psshfs; do :; done
Julio Merino / @jmmv
Main Index |
Thread Index |