tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: fd code multithreaded race?



On Wed Aug 04 2010 at 13:21:07 +0000, Andrew Doran wrote:
> On Sat, Jul 31, 2010 at 08:31:19PM +0300, Antti Kantee wrote:
> > Hi,
> > 
> > I'm looking at a KASSERT which is triggering quite rarely for me (in
> > terms of iterations):
> > 
> > panic: kernel diagnostic assertion "dt->dt_ff[i]->ff_refcnt == 0" failed: 
> > file 
> > "/usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c", 
> > line 856
> > 
> > Upon closer examination, it seems that this can trigger while another
> > thread is in fd_getfile() between upping the refcount, testing for
> > ff_file, and fd_putfile().  Removing the KASSERT seems to restore correct
> 
> You're right there, the KASSERT() is wrong, it should be removed.

Thanks, I'll do that.

> > operation, but I didn't read the code far enough to see where the race
> > is actually handled and what stops the code from using the wrong file.
> 
> FYI the fdfile_t (per-descriptor records) are stable for the lifetime of the
> process, what each record descibes can and does of course change, and how
> those records are pointed to does change (fdtab_t).
>  
> There isn't really a concept of "wrong file", as in, the app gets
> what it asked for.  It is free to ask for the wrong thing, and it's free
> to ask for the right thing at the wrong time, etc - that's its problem.
> 
> Unless you're alluding to another bug?

Not really.  I just started thinking about how applications can make
sure they use the right file descriptor.  It seems using close() to
notify other threads of a file descriptor being closed is racy.

So something naiive like this:

t1: lock
t1: get fd1
t1: unlock
/* t1 wants to do a syscall with fd1 but is preempted */
t2: lock
t2: close fd1
t2: unlock
t3: lock
t3: open, result fd1
t3: unlock
t1: syscall fd1 ...

will give you the wrong result.  Essentially there is no interlock from
the application lookup to the kernel backing object lookup.

So I guess if you want things to work correctly, instead of close()
you need to dup2() to a zombie/"deadfs" fd and wait for all threads to
check in before you can close it.  (i assume dup2 is atomic)

Never realized file descriptors and threads were so tricky ;)


Home | Main Index | Thread Index | Old Index