ATF-devel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: failure in NetBSD while running as root



+ martin, pooka, who have seen this periodically

On Tue, Mar 29, 2011 at 7:42 PM, Julio Merino <jmmv%netbsd.org@localhost> wrote:
> On Tue, Mar 29, 2011 at 7:37 PM, Jeff Rizzo <riz%netbsd.org@localhost> wrote:
>> atf-run: ERROR: XXX: Cannot get information of /tmp/atf-run.14884b/mnt;
>> atf-run: lstat(2) failed: Device not configured
>> g4:riz  /usr/tests/fs/psshfs>
>>
>>
>> I have actually removed the umount, and while I don't get the error, it just
>> fails sometimes.
>
> Aha!  I could eventually get it to fail here with "Device busy",
> although it surely is the same issue.  It smells like race condition.
>
> I'll take a look.

Alright.  I know what's happening.  The offending test case is
mounting a rump file system and it is running a daemon in the
background that creates a pid file in the work directory.  During the
test case cleanup, the file system is unmounted and the server is
killed.  And here comes the race condition: neither unmounting the
puffs file system nor the termination of the server (with the
accompanied removal of its pid file) are synchronous.

When atf-run attempts to do the work directory clean up, it scans a
still-changing file system and bad things happen.  For example, it may
enumerate the directory contents first and, later, when attempting to
delete a supposedly-existing file, get a ENOENT.  Or it may try to
enumerate the contents of a mount point at the same time as the puffs
server process is exiting.

I have committed revision 648ed6360b2b7cda81a6079b00dc436d09c745b8
which implements a workaround for this situation: the "fix" is to
either retry failing file system operations a few times in an attempt
to allow the work directory to stabilize, or to ignore
supposedly-transient errors.

Now, this feels like a very ugly hack but I'm not sure how we'd do
better.  For file systems, I see that fuse has a "sync_unmount"
command/flag (dunno what exactly it is) that was added for exactly
this purpose.  For daemons... we can't control their termination
because they are outside of the process group; the test case could
either kill -9 the server or explicitly remove the pid file before
exiting, but these are even worse workarounds.  I'm open to ideas.

Martin, Jeff: I have pulled this change into NetBSD's src.  Could you
give it a try?  I'd like to release atf-0.13 RSN (like today or
tomorrow ;-) and it'd be great if this issue was gone once and for
all.  If the patch does not fix the problem, the following should
trigger the failure pretty quickly:

# cd /usr/tests/fs/psshfs ; while atf-run t_psshfs; do :; done

Thanks,

-- 
Julio Merino / @jmmv


Home | Main Index | Thread Index | Old Index