ATF-devel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: failure in NetBSD while running as root



On Wed Mar 30 2011 at 12:21:59 +0100, Julio Merino wrote:
> + martin, pooka, who have seen this periodically
> 
> On Tue, Mar 29, 2011 at 7:42 PM, Julio Merino <jmmv%netbsd.org@localhost> 
> wrote:
> > On Tue, Mar 29, 2011 at 7:37 PM, Jeff Rizzo <riz%netbsd.org@localhost> 
> > wrote:
> >> atf-run: ERROR: XXX: Cannot get information of /tmp/atf-run.14884b/mnt;
> >> atf-run: lstat(2) failed: Device not configured
> >> g4:riz  /usr/tests/fs/psshfs>
> >>
> >>
> >> I have actually removed the umount, and while I don't get the error, it 
> >> just
> >> fails sometimes.
> >
> > Aha!  I could eventually get it to fail here with "Device busy",
> > although it surely is the same issue.  It smells like race condition.
> >
> > I'll take a look.
> 
> Alright.  I know what's happening.  The offending test case is
> mounting a rump file system and it is running a daemon in the

rump??

> background that creates a pid file in the work directory.  During the
> test case cleanup, the file system is unmounted and the server is
> killed.  And here comes the race condition: neither unmounting the
> puffs file system nor the termination of the server (with the
> accompanied removal of its pid file) are synchronous.
> 
> When atf-run attempts to do the work directory clean up, it scans a
> still-changing file system and bad things happen.  For example, it may
> enumerate the directory contents first and, later, when attempting to
> delete a supposedly-existing file, get a ENOENT.  Or it may try to
> enumerate the contents of a mount point at the same time as the puffs
> server process is exiting.
> 
> I have committed revision 648ed6360b2b7cda81a6079b00dc436d09c745b8
> which implements a workaround for this situation: the "fix" is to
> either retry failing file system operations a few times in an attempt
> to allow the work directory to stabilize, or to ignore
> supposedly-transient errors.
> 
> Now, this feels like a very ugly hack but I'm not sure how we'd do
> better.  For file systems, I see that fuse has a "sync_unmount"
> command/flag (dunno what exactly it is) that was added for exactly
> this purpose.  For daemons... we can't control their termination

unmount (i.e. umount /mnt/path) *is* synchronous.  However, in case
the server exits (via signal, crash, or whatever) there is a window of
limbo during which the file system is being unmounted from the kernel.
Since signals happen asynchronously, I can't see how it's possible to
provide anything synchronous against that.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa


Home | Main Index | Thread Index | Old Index