tech-kern: Re: FS layering bug?

Subject: Re: FS layering bug?
To: Bill Studenmund <wrstuden@netbsd.org>
From: Marton Fabo <morton@eik.bme.hu>
List: tech-kern
Date: 01/26/2004 20:11:01
Bill Studenmund wrote:

[sorry for quoting all this, but this is still relevant for the current 
post]
>>I have a 1.6ZC i386 system from some last october -current source, with 
>>the default FS layout of / and /usr as two FSes.
>>
>>I have tried the following: I copied all the contents of the / 
>>filesystem to /var/tmp/chroot, then I null-mounted /usr on 
>>/var/tmp/chroot/usr read-only, and then I union-mounted an empty 
>>/var/tmp/chroot/chroot-usr directory over /var/tmp/chroot/usr. This is 
>>so that I have a full writeable replica of my system in /var/tmp/chroot, 
>>with only having to duplicate the contents of the root FS.
>>
>>Now I chrooted to the above dir, started to play around, everything 
>>worked like a charm, I could modify it without endangering anything in 
>>my real system. But after some time, the kernel panicked with the 
>>message "locking against myself".
> 
> 
> Oops!
> 
> 
>>The question is whether 1) this is an inherent, predictable crash in the 
>>above sketched scenario, or a bug; 2) is it fixed in current -current; 
>>and 3) if the answer is "no" to both of the previous questions, what may 
>>possibly be done to fix it.
> 
> 
> Probalby a bug. I doubt it's been fixed lately. The most important thing 
> to get is the stack trace of the locking against myself panic.

How do I do it? I'm not really into kernel debugging. A ponter will be 
enough... Anyway, will I have to write the stack trace down on paper, or 
is there a way to save the ddb (or however the kernel debugger is 
called) session's log?

> Wait, where is the empty directory coming from?

As I already have written, the unioned empty directory was 
/var/tmp/chroot/chroot-usr, so it was a regular directory in the 
chrooted environment's root dir.

> Also, why not just use mount -t union -o -b /usr /var/chroot/usr ?

Yes, now that I checked mount_union, that would have been an option. In 
fact, I'll check if that is good for me. I just overlooked it because my 
initial intent was to "replicate the filesystem read-only, and then 
overlay an empty writable layer"...

Anyway, it's still interesting why the former solution confused and 
crashed the kernel.

>>PS.: Formerly I tried null-mounting / on /var/tmp/chroot also, using a 
>>modified /sbin/mount_null with the check for distinct directories 
>>disabled. It resulted in the same locking panic; but I accounted that 
>>crash to the assumption that mount_null had a valid reason to not allow 
>>mounting non-distinct directories over each other by default. Now this 
>>also became an open question whether null-mounting /a/b over /a/b/c/d is 
>>expected to cause a locikng error and kernel panic.
> 
> 
> That null mounting (/a/b over /a/b/c/d) will lead to a kernel panic.  
> That's why the test is there. Directories are locked from root outward. So
> you lock /, then /a, then /a/b, then /a/b/c, then /a/b/c/d. The problem
> with that null mount is that /a/b and /a/b/c/d will end up having the same
> lock. So when you lock /a/b, you also lock /a/b/c/d. Consider two
> processes looking up the path name "/a/b/c/d". One of them (call it #1)  
> has gotten to /a/b/c and is looking up "d". It has /a/b/c locked at that
> point. The other one (#2) comes along, and gets to /a/b looking for "c".  
> It ends up waiting for #1 to release its lock on /a/b/c, and it has /a/b/ 
> locked in the mean time. Due to layering, it also has /a/b/c/d locked. #1 
> now waits for the lock on "d", but it will never get released. The 
> kernel's now deadlocked.

That sounds quite logical. Can't this be fixed by some clever mechanism 
however, other than simply disallowing the scenario?

<15 minutes pass>

Actually, it looks like it really should be addressed somehow. Look at 
the following session transcript:

> [morton@gerzson:/usr/home/morton | 01/26 19:49:17]
> # mkdir tmp
> [morton@gerzson:/usr/home/morton | 01/26 19:49:17]
> # mkdir tmp1
> [morton@gerzson:/usr/home/morton | 01/26 19:49:17]
> # mkdir tmp1/tmp2
> [morton@gerzson:/usr/home/morton | 01/26 19:49:36]
> # mount -t null tmp tmp1/tmp2
> [morton@gerzson:/usr/home/morton | 01/26 19:49:44]
> # mount
> /dev/wd0a on / type ffs (local)
> /dev/wd0e on /usr type ffs (NFS exported, local)
> /usr/home/morton/tmp on /usr/home/morton/tmp1/tmp2 type null (local)
> [morton@gerzson:/usr/home/morton | 01/26 19:49:46]
> # mv tmp1 tmp3
> [morton@gerzson:/usr/home/morton | 01/26 19:50:01]
> # ll
> total 144432
> ...
> drwxr-xr-x    2 morton  users       512 Jan 26 19:47 tmp/
> drwxr-xr-x    3 morton  users       512 Jan 26 19:49 tmp3/
> [morton@gerzson:/usr/home/morton | 01/26 19:50:03]
> # mount
> /dev/wd0a on / type ffs (local)
> /dev/wd0e on /usr type ffs (NFS exported, local)
> /usr/home/morton/tmp on /usr/home/morton/tmp1/tmp2 type null (local)
> [morton@gerzson:/usr/home/morton | 01/26 19:50:05]
> # touch tmp/a
> [morton@gerzson:/usr/home/morton | 01/26 19:50:14]
> # ll tmp
> total 0
> -rw-r--r--  1 root  users  0 Jan 26 19:50 a
> [morton@gerzson:/usr/home/morton | 01/26 19:50:16]
> # ll tmp3/tmp2/
> total 0
> -rw-r--r--  1 root  users  0 Jan 26 19:50 a
> [morton@gerzson:/usr/home/morton | 01/26 19:50:24]
> # umount tmp3/tmp2
> umount: /usr/home/morton/tmp3/tmp2: not currently mounted
> [morton@gerzson:/usr/home/morton | 01/26 19:50:32]
> # umount tmp
> umount: /usr/home/morton/tmp: not currently mounted
> [morton@gerzson:/usr/home/morton | 01/26 19:50:40]
> #

Here I basically null-mounted a directory over another (distinct, in 
this case) subdirectory, and then renamed a parent of the mount point. 
The mount remained active, but I had no way to unmount it anymore. (To 
be exact, after re-renaming the parent to the original name, I could, 
but this still seems to be quite risky.)

So, I guess the full path to all mount points should be prevented from 
any modification (at least renaming). This alone still wouldn't solve my 
problem.

But to allow access despite preventing modification, processes should be 
able to lock for reading, and of course any number of *readers* should 
be able to access the directories concurrently. And this would solve my 
problem too, since the concurrent read-only accessor processes wouldn't 
block each other, thus not cause a deadlock.

mortee

PS.: Or perhaps the scenario from the transcript isn't considered a bug, 
but rather a user error?